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ABSTRACT 

Integrative analysis^ or what is coaing to be known 
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■erely statistical lethods^ suitably adapted in ma»ny instancesr that 
are applicable to the j^bb of integrating findings from jnany studies. 
A aeta-analysis involves about a half-dozen steps: (1) defining the 
problei, (2)- finding the research studies^ (3) coding the study 
characteristics. The thinking and research reported here is recorded 
in toughly the same order. The report encoipasses geaeral background 
on the approach^ and the results of soae orig^inai research on 
approach taken in a leta- analysis ^ numerous illustrations of the 
approach y and the results of some original research on 
characteristicsr (4) measuring ti^e stflMy findings on a common scale ^ 
and (5) analyzing t^e aggregation of findings and their relationship 
to the characteristics. ThetJinJt-ing can be read' m at least three 
ways: as- a textbook of m^tUodT of integrative analysis ^ as a racord 
of some new ideas about integrative analysis^ or as an apaicJgia for 
aeta-analysis. (Author/BM) ) ^ 
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Ben-Adhem- picked up a stone frorn beside the road. "It had written 
'on it, 'Turn ne over and read.' So he picked it up and looked at 
she other side. And there vas written, 'Why do you seek more 
.knowledge when you pay no heed to what you know already''"' 



^ - ' Shah (1963, p. 110) 

4 



"Science is puilt up with facts, as a hous^ is wich stones. 3uc a 
collection of facts is no more a science than a neap of stones is 
a house . " V ' 

Paincare (in La Science ec 1 ' H%"::othese) 
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/ ■ . 



' 4 ' .ABSTWCT 

Integrative analysis, or what is coming ta be known as meta-analysis, is 

the 'integration of the findings of. many empirical research studies of a 

topic . For example, it mi*ght be undertaken to summarize the findings 

of fifty experiments on the effects of amphetamines on hyperactive pupils. 

Meta-analysis differs from traditional narrative forms of research reviewing 

in that it i§ mere quantitative and'statistical . Thus, the methods of 

- # 

met^-analysis are merely statistical methods, sui tably- adapted in rr^ny 
instances, that are applicable to* the job of integrating findings from 
m^ny studi e^'. 

r meta-analysis involves about a, half-dozen steps-: defining the ^ 
. problem, 2) finding the research studies, 3) coding the study characteristics, 
4) measuring the study findings on a common scale, and 5) analyzing the - 
aggregation of fif^dings and their relationship to the characteristics. The 
thinking and research reoorted here is recorded in roughly the same order. 
The report encomoasses general background on the approach taken 1n a meta- 
analysis, numerous illustrations of the approach, and the resixlts of some 
original research' on statistical methods uSed in meta-analysis. The report 

can be read in at least three ways: as a textbook of methods of integrative 

\ 

✓ analysis, as a record of some new ideas about integrative analysis, or as 
an aoologia for meta-analysis. 
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. . This tnay "be precisely ^the right :ime to write this book ot 

preci^sely. the vro.ng ti-^e . Tne 'reader ought not to assume that since 
this book lies before hia chat we came evenCuallv to believe tnat the 
roraer was true. For we may have persevered ii-* the face of ambivarence 
arid written in spite, of our doubts. Or we may have willfully written' 



^ book knowing that its time was not- right. In fact, ye wrote this 
book from necessity. Much of the work on which it is foased has been 
supported for the past two yeats by a grant from the Saticnai Institute 
of Education. -We are obliged now lo file with the Institute some 
reasonable record of ou-r-^eTf-of ts and their fruits.' Propitious or 
not, thi/ book will be written- irt this moment and nJst some later one. 
The reader vjho has ever struggled with writing a bopk will understand 

^ when we say tnat we now have th^t feeling that if we don't write it 

^now, it will ne^ier get written. 

Our suoiect is the methods of integrating empirical research. 
The piobkms we address lie at the center of a tiny revolution in the 
way social scientists and researchers attempt to extract knowledge \ 
from empirical inquiry and communicate it. The revolution was spawned 
by necessity. T^e findings of empirical research grtw exponentially 
in the middle' fifty years of the 2Cth century. Evidence — even the 
organized, analyzed and codified evidence of the archival journals. — 
multiplied beyond the ability of the -unaided human mind to process it. 
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. CHAPTER ONE 

THE PROBLEMS OF RESEAliuK R£VIEW AND INTEGRATION ' 

% 

The Mthematician David Hilbert once said that the importance 
of a scientific vork can be measured by the number of previous publi- 
cations it makes superfluous to read. There^ is a hint of grouchiness 
and despair in Hilbert's complaint that scholars in all fields 
increasingly feel. What is one to make of the cornucopia o^ research 
literature? Car. one make anything of it, or does one inevitably founder 
in the riches of empirical inquiry and sink to obselesence*^ 

The house of social science research is sadly dilapidated. 
It is strewn among^the scree of a hundred journals and lies ^bout in 
the unsightly rubble of a million dissertations. 'Even it if cannot 
be'built into a science, the rubble ought to be sifted and culled for 
whatever good there is in it. 

Maccoby and Jacklin'5 (1974) review of research on the psycho- 
logy of sex differences, encompassed '1,600 works published before 1973. 
If one considers the literature on that topic sonce 1973 and reaj^izes 
that many studies not focused specifically on sex differences may 
contain data on the question, then an estimated population of over 
5,000 studies can be imagined. Dpzens of educational problems could 
be named on which the-^^ vail able research literature numbers several 
hundred articles: ability grouping, reading instruction, programmed' 
learning, instructional television, int-egration , etc. ^/?hen Miller 



(see Smich, Glass and Miller, 1980) sec ouC co deCennine rhe effects 
of drug^cherapy on psychological disorders, he found published reports 
of clinical experiments in such abun^dance (numbering literally 
thousands of studies), that he was forced to impose a sampling frame on 
the immense body of literature and take a survey samule of experiments: 
Social and behavioral research is a large and widely scattered enter- 
prise. ^ On problems of importance, it produces literally hundreds of 
studies in less than five years. The research techniques used, the 
measurements taken, the types of person studied — each may vary in 
bewildering irregularity from one study to the next even though the ' 
topic is the same. The research enterprise in education and the social 
sciences is a rough-hevn, variegated undertaking of huge proportions. 
Determining what knowledgeVhis enterprise has produced on some question 
is, itself, a genuinely important scholarly endeavor. 

The style of research integration has been, shaped by the 
size, of the research literature. In the 19A0's and '50's, a contri- 
butor to the Review of Educational Research „or Psychological Bulletin 
mi^t find one or tvo dozen studies on a topic. A narrative, 
rhetorical integration of so few studies was probably satisfactory. 
By the late 1960's, the re$earch literature had swollen to gigantic 
proportions. Although scholars continued to integrate studies narra- 
tively, it was becoming clear that chronologically arranged verbal 
descriptions of research failed to portray the accumulated knowledge. 
Reviewers h^gan to make crude classifications 4nd measurements 'of the 



condi:ions and results of studies. Typically, studies were classi- 

fied m contingency tables by type and by whether oujtcones readied 

statistical significance. Integrating the reseat^ij^literature of 

the^l970's demands more sophisticated techniques of measurement and 

statist;ical ^alysis. The accumulated findings of dozens or even 

hundreds of studies should be r^gard^d as complex data points, no • 

more comprehensible without. the full Use of statistical analysis 
» 

' than hundreds of data points in a single study could be so casually 
ui^der^tood. Contemporary research reviewing ought to be undertaken 
in a style that is as 'technical and statistical as it is narrative 
and rhetorical. Toward this end, we suggested a- name 'to make the 
needed approach distinctive.. The desired approach was earlier 
referred to as the meta- analysis of re 
have no stake in the use of this term; 

only incident^ly so. It was chosen to suggest the^nalysis of 
analyses, i.e., the statistical analysis of the findings of many 
individual analyses. The term integraiti^ analysis , might serve as 



search (Glass, 1976). We 
it sounds pretentious, but is 



well, but me ta- analysis 'has entered ccimmon parlance ^ong some 
researchers fairly quickly an d^ may beiome ' conventional. Secondare/ - 



analysis is imprecise to the point of 



being misleading and should 



not be used interchangeably with thesk terms; it connotes an alto- 
gether different activity (Cook, 19|A). Where a modification is nee 
to distijjgi^sh the meta-5^alysis of i body of studies from' each of 
the studies individually, primary research can be used to denote the 



latter, 



T 
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' Researchers have apparently thought Iitt>le about the methodo- . 
logical and technical problems of research integration, ^ight and Smith 
(1971)^ first gave serious attention to these problems » Their paper is a 
careful treatment of the inadequacies of simple metho<^of research 
integration. Their proposed solution — the cluster approach — is in 
the spirit of the solution recommended hjere,- but it is more conservative: 
. ♦ little headway can be made' by pooling the words in the conclusions 
of a set of studies. Rather, progress will only come when we are able to ^ 
pool, in a sys-rematic manner, the original data from "the studies." 
(Light and 'Smith, 1971, p. 4A3.) This assumption and the methods based 
on it probably discard far too many informative studies 'for which the 
data are no longer available, though the summary findings remain. 

Gregg Jackson (1978), a sociologist, conducted what is perhaps 
the finest study yet of the practices and m^hods of- research reviewers 
and synthesizers m the social sciences. sampled at random 36 

integrative reviews from the leading journals in education, psychology 
and sQciology^ The varioLfs features of method of each review were coded 
' according tp the categories of an extensive coding form that Jackson 
created. His conclusions: 

a) Reviewers frequently fail to examine critically the evidence, 
methods and conclusions of previous reviews on the^ same or 
similar topics. (Although 75 percent of the reviewers cited 
previous reviews, only 6 percent examined them critie:ally . ) 

t 

b) Reviewers often focus their discussion and analysis on only 
a part of the full set of studies -they find, ^nd the subset 
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examined is seldom a representative sample nor is'it^clear how 
it ('the. subset) was ^chosen. (Only 3 percent gf tshe reviewers 
appeared to have used exist^g indexes — e.g.,' ERIC — in 

their search; only 22 percent selected a fair saijiple of studie's 

' / 

in the judgment of Jackson's coders; and qnly 3 percent analyzed 

J* 

the full set of studies found.) 

c) Reviewet-s frequently use crude and^ misleading representations 
of the findings of the studies. (About 15 percent of the ' , 
reviewers classified studies according to whether their findings 
were "statistically significant," a practice which will be 

I 

criticized in Chapter 5;- frequently, reviewers report test- 
statistics (£, F^, etc.) for one or more -studies. ^ 

d) Reviewers sometimes fail to recognize that random 'sampling 
error can play a part in creating variable findings among studied. 

e) Reviewers frequently fail to asses systematically possible 
relationships between the characteri^stics of the studies and 
the study findings. (Fewer than^lO percent of the reviewers 
studied whether, the findings of th^ research were mediated 'by 
characteristics of thi' persons studied, the study context, the ' 
nature of the experimental interventioi> or the characteristics 

r 

of the research design,) The lack of systematic examination 
of the^e relationships is important ^ecausj reviewers fre«^uently 
eliminate studies^ from consideration because of a priori ' 
►judgments that their findings are flawed by one or another study 
characteristic. 



fO Reviewer's^ usually ^repor^ so little aboit their aethods^of 
.reviewing that Jthe reader cannot judge ;i<e validity of the „ > 
conclusions. 

' ' Jac^ksbn also surveyed a small group of fewer" Chan a dozen 
editors z£ review jd\jrn^rr and executives of social science organizations ' 
in an aptenlpt to determine vo'ich practices* and standards prevail in. 
their reviewing and integrating activities. He concluded that this 
survey was unproductive but it was only vnpr9ductive af an articulated 
set of propedures and methods of study revf ew and integration for the 
simple reason that such apparently do' not^^ist. Jackson's small survey' 
revealed clearly that the conception of research review and integration 
that prevails in the social and behavioral sciences is one in which 
the a<;tivity is viewed as a matter of largely j)rivate - judgement', ^ 
Hndividuai creativity and personal styleT Indeed,, it is and ought to/ 
■^i)e all oflfthese to some degree; but if it is nothiag but these it is 
curiously inconsistant with the activity (viz., scientific researclj) 
It ^purports to illuminate. ^ 

Jackson ( 1978) went on in Chapter Six of his report to give 
a valuable list of guidelines/for integrStive Viewing that encompass 
such aspects of the -process as selecting the tooic, sampling studies, 
coding the characteristics studies, analyzing, the data and inter- 
preting the results. (Not coincidentally , guidelio^s for performing 
a primary research study could well be classified under the same 
headings J Jackson devoted Chapter Four, "A New Alternative: Me ta-Analysis 



of his report to a description and critique of the approach that is 

the subject of this book. 

Under the pressure*of burgeoning research literatures, old 

and infonaal narrative ^techniques of research review and integration are 

breaking down. 'The fundamental problem is one of the mind's limitations 

and the magnitude of th« task to which it is 'applied. The reviewer is 

even'iess able to. absorb the sense of one hundrfed research studies than 

is -an observer able to scan one hundred test scores and, without 

reliance on statistical methods, absorb the sense of their .size and 

spread and correlations. Cooper and Rosenthal (1980) recently conducted 

an experiment in integrating research findings that illustrate'd these 

points. 'About forty persons (graduate students 'or more experienced) * 

were randomly split into two groups. vSubjects in both groups wefre .gSyen 

seven empirical studies on sex differences in persistence" to review.- 

Subje^cts in Group A were told: 

'^Before drawing any final conclusions about the overall results 

of persistence studies, please take a, moment to review each 
^ individual study. In generating a single conclusion ^,rom the 
^.ndependent studies, employ whatever criteria you would use if 
this exercise were being undertaken for a class' term paper or 
manuscript ^ for publication." ^ 




Thus , Group A employed traditional , narrative techniques of 

integrating" che findings of the seven studies. By contrast, Group B 

was instructed as follows: 

"Before drawing any^ final conclusions about the overall 
results of persistence studies, you are asked to perform a 
simple statistijcal procedure. The procedure is a way of com- 
bining the probabilities of independent studies. The purpose- #n 
of the procedure is to generate, a single probability level which 
relates to the likelihood of obtaining a set of studies dis- 
playing the observed results. This probability is interpreted 
just like that associated with a t- or F-s*ta?istic i For 
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example, assume :he procedure produces a 'probability of .04. 
This would mean there are U chances in 100 that a" set of 
stud;Les showing these results were produced by chance. The 
procedure is galled the Unweighted StoiTffer method, and 
requires that'you do^-the following: 

1") Transfer the probabilities recorded earlier from each ^tudy 
to Column 1 ojf t^ie Summary Sheet. [A summary sheet was provided 
each subject. The sheet contained the titles of the §even 
articles and columns for performing each step in the procedure.] ' \ 

2) Since we are testing the hypothesis that females are 
more persistent tha/ males, divide each probability in half 
(a probability of 1 becomes .5). If a *study found men more 
persistent, attach a minus sign to it's probability.. Place - 
these numbers in Column 2. [it had been determined before N 
hand 'that only two->ailed probabilities vei'e reported.] 

»^ 

3V ' Use the Normal Deviations Table provided below and trans- 
form each probability in Column 2 inco its associated Z- 
score. Place these values (with sign) in Columli 3. If the 
probability is .5, the associated 2-score is zero (0). 

4) Add the Z-scores^in Column 3, keeping track of algebraic^ 
sign. Place this value at the bottom of Column 3. » ^ 

5) Divide this number by the square foot of the number of 
studies involved. In this case, because N - 7, this number 
is 2.65. Thus, divide the sum of the Z-scores by 2.65. 
Mace this number in tne sp^ce below. * . ^ 

Z-SCORZ FOR REVIEW 



6) Return to the Normal Deviitions Table and identify "the' 
y prob^Dility value associated with the Z-score for review. 

Place Lhls number ip the space below. 

P-VALUE FOR REVIFw 



This probability tells how likely it is that a set of studies 
with these results could have been produced if there really 
were n-o relation between gender and persistence. The smaller 
the probability, the more likely it is that females and tteles 
differ in persistence, based on these studies." (cf. 1930, p. 445,) 

Subjects in both Grot^s A and B rated their opinion of the 
strength of support for a conclusion of a relationship between sex 
and persistence in the seven studies. In fact, the combined results 
from the seven studies supported rejection of the null hypothesis of 

. 10 
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no difference at beyond the .02 l^ei. The following frequencies' 
were obtained: 



Opinion 
(Is the'^ a 
relationship?) 



Group A 
Traditional. Methods 
of Review 



No, 



Grou^^ B 
Statisti-cal Methods 
of Review 



No, 



Definitely No 


3 




\ 


1 


52 


Probkbly No 


'13 ' 


. 59 




5 


26 


Xmpossible to 
Say 


5 


23 




8 


42 


Prefab ly Yes 


1 


5 

1 

0 




5 


26 


Definite^ty Yes 






0 


0 






iOO% 






1002 


The 


results 


are remarkable 


Near 


ly 75 percent of the reviewers 


who relied on 


traditional narrative 


methods 


concluded 


that sex and 



persistence were not related; the comparable f^^gure' among the group 
using statistical methods of review was 31 percent — rather strikingly' 
different conclusion^ for equivalent groups trying to integrate only 
seven studies. ^ 

An issue of nearly ejqual importance concerns the magnitude of 
the relationship that the seven studies revealed. Again the reviewers 

, in both groups were asked -to rat|^their perception of the strength of 

r 

the relationship. 
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Gcoup A 
Traditional 'Methods 
of Review ; 



Group B 
Statistical Methods 
of Review 



Opinion 
'(How .large is the ^ 
^ sex difference \h 
persistence?) 



No, 



No. 



None at all 4 

Very small \ 12 

Small \ 4 

Moderate- 2 

Larg? 0 



-18Z 
■55 ■ 
18 

9 

0 

loo: 



2 

6 
6 
4 
1 



112 
-32 
32 
21 
5 

100% 
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The above data repeat the general findings apparaent in the 
previovls table: persons using the two different methods of research, 
integration formed quite different impressions about vhat the studies* 
indicated. Cooper and Rosenthal e-jcamined these processes on a small 
collection of^scudies; the entire'set of sever;, studies " opcupied a 
total of 'fewer than fifty journal pages. One cah imagine how much 
more pronounced would be the difference between these two approaches 
with bodies of literature typical of the size of literatures that 
are increasingly being addressed with mefa-analytic techniques. 
This difference wiU become more apparent to the reader who mends 
his way thra;gh the complex examples of research integration in the 
remainder ofj this book. ' ' 

Consider another example of the contrasting conclusions arrived 
at through contrasting methods of review ind integration. In a review 

of experiments on the effects of teachers' use of higher cognitive 
•\ 12 



quesrions o?i students'^^chievesient , Winne ( 1979) concluded tna: :he 
former had no beneficial impact on :he ^latter. A meca-analv.sis 
of virtually the same studies by Redfield^ and Rosseau (1980) revealed' 
rh'^t on the average, students given higher cognitive level questions 
scored one-half standard deviation higher on achievement tests. 
Thus, informal and narrative techniques of revietr.and integration 
discredited a finding that quantitative methods of integration showed 
to be consistent and large. , ' ^ * 

Narrative researcn reviews often r.ake nc attempt at rigorous 



derinition and standardization of technique? for treating studies. 

Hence, impressions are subject to prejudice and stereotyping *to a 

degree that would be unforgivable in primary re'searc^i itself. Consider^ 

^n. instance encountered by Miller (1977) in his meta-analysis of 

experiments on the psychological benefits of drug therapy. At 6ne 

point, attenti^on focused on ^he question whether the combination of 

verbal psychotherapy and drug therapy was superior to the drug therapy 

alone. Three different traditional reviews completed within about 

five years of each other and based on largely the same^literature 

arrived at the followii;g conclusions: ^'^^ 

"The advantage for combined treatment is striking. . . 
a combination of 'treatments may represent more .than an Additive 
effect of two treatments — a * getting- more for one's money' — ► 
the're may also be some mutually facilitative interaction benefits 
for the combined treatments." (Luborsky, et.al., 1975, p. lOOA) . 

. . Tnere is little difference between psychotherapy plus 
drug .and^ drug therapy alone for hospitalized psychotic patients 
(but not for neurotic out-patients). The combination is, however, 
quite clearly superior to psychotherapy alone." (May, 4971, p. 513)' 
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rwhen all is said and done, the existing studies by no 
ineans perTnic finr. conclusions as^ to the nature of the interaction 
between combined 'pfvchotherapy and medication. (UnJ^enhuth, 
. * Lipmen, L Covi, 1969, p. -611). 

Ae disparity among 'these re^^iwers is not limited to their 
conclusion^ but extends even to their classification of individual 
experiments. Miller (1977) found five reviews (the three quoted above 
and two others) addressed specifically to the "psychotherapy - 
plus - drug" versus "drug therapy" issue. In Table l\ I , the , 
reviews, the studie: reviewed and how tney were classi^fied are 
reported. Notice, for example, that Luborsky et. al .,-(I975) 
classified the jS-orham study, the Cowden study and the King study 

-nding tha,t "drug -^plus - psychotherapy" was superior to 
"drug therapy" alone, whereas both Ublenhuth (1969) and May (1971) 
in tneir reviews classified the wihe studies as shoving no difference 
or a difference in the reverse order. 

a 

^ Obviously, different reviewers sometimes see things differently. 
Tne only way to force all* r'eviewers . to see the same thing is/to* 
demand a standardization of def initioi^ji^and techniques of reseat^ 
int-egration. We^ don't suggest such ,-\.ndeed , it would bfe ill-advised, 
since the little "reli^ility" that would be gained would probably 
be more than off-set by phe creativity that- would be staunched by 
uni-f ormity .' 

^ It is not uniformity in research reviewing and in tegratinif' that . 
is desirable, r^er it is clarity, explicitness and openness — 
those properties that are^characteristic of the ^scientific method more 
generally and which i?ftpart to inauiry its "objectivity" and trust- 
worthiness. >^ ^ ' 
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Table 1.1 



.Sannary'of Findings of Five Reviews Comparing Drug 
21us Psychotherapy with Drug Tlierapy 
. (After MilUx- ]977) 










j Reviewer 


h 

' D ^ ?>D 


. t)>D + P or' D P ■ D 


> Group fo^ rhe 
Advancement of 
Psychiatry 

\ ^ (75) ' 

t 


\ :<ing (58) 

' Evangelxkas (51) 
! r^err^an (7w) 

Honizfeld (ftL) 


1 May C6A) 

Cowden (55,56) 

i 

j 

I 


I Gilligan (65) 


Evangelikas (61) 


Cowden (56) 


i . ^ 
j 

i 
I 

! 

1 Uhlenhuth (69) 

> 

1 ? 


King (58) 
Evangelikas (61) 

■ / - 


1 

Cowden (56,jj^7) 
King (63) 
Konigfeld -(6^) 

Go r ham (/S^y ' i 
May (6A) 



'luborsky (75) 



Gorham (5^ ) 

Ilogarty (73) 

i 

' Cowden (56) 
KingY63) 
Luborsky (5A ) 
Klennan (7^) 



King (60) 
May (65) 
Pascal (56) 
Evangelikas (61) 
Kroeger (67) 



r 



May (71) 
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Gorham (6^) 



King (63, 58) , 
Gorham (64) 
Cowden (56) 
May (6A) 

EvaAgplikas (61) 
Lorr (62) 



r 



Iw IS often said of experimental ^research that is must be 

repllcable to be scientific. Surely "the^rue test of whether a 

finding is replicable - is to repiicate it; but as' is observed ad naseun , 

stud.'S never actually are replicated. Hence, the scientific attitude 

in research can not truly depend' on replicability . Indeed, if one 

inquires more deeply into the question, one discovers that it is not * 

repiicabili/y that is. desirable in a^scientific study, but the 

description of a study so that it could in theory- be replicated , i.e., 

so that ir one desired he could perform the same steps that led to 

the prior observations^. Hence, to report a study so that it is 

"replicable" nseant to report it with ^uch clarity and explicit.ness 

4that a second investigator could follow the identical steps to the 

identical conclusion. Thereby, science is guaranteed to be "inter- 

subjectiiire" rather than an endeavor subject to the whims and 

idipsyncracies of individual researchers. These values 'and standards 

are ingrained in the contemporary scientist's training; but too often 

he forgets his responsibility to the scientific method when he changes 

context slightly and seeks to integrate numerous etnpir^cal Studies 

instead of perform e single primary study. Thus do reviews becoII^e 

idiosyncratic, authoritarian, subjective — all those things that 

cut against the scientific grain. 

The' important point about the example in Table 1.1 is not 

that Uhlenhtth, Luborsky and May disagreed, but ' that , they did ^not 

approach the problem of research integration with methods so explicit, 
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unambiguous and operationally laennfied that any outside party could 
examine the sane evidence and cOme to the same conclusion. Bv contrast 
^Miller (1977) approached the same research integration problem (viz., 
*'drug - plus - psychotherapy" vs. "drug therapy") with an attitude 
like that of a researcher collecting and analyzing primary data: 
concepts must be defined and measured, measurements must be checked 
for reliability, evidence must not be excluded on arbitrary or ad hoc 
grounds, multiple observations inform on residual error, statistical 
methocs are an important adjunct to raw perception. He found that 
the combined effect of drug and psychotherapy was approximately three- 
tentns ' standard deviations (on outcome measures of psychological well- 
being; greater than the isolated effect of drug therapy (see Jhapter 8 
m Smith, Glass and Miller, *1980). 
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CHAPTER TWO ^ 



META-ANALYSIS OF RESEARCH 

Primary analysis is che original analysis of data in a research 
study. It is what one typically imagines as the application of 
statistical methods. 

Secondar y analysis is the re-analysis of data for the purpose 
of answering the original research question with better statistical 
tecfiniques, -or answering new questions with old data. Secondary 
analysis is an important feature of the research and evaluation enter- 
prise. Tom Cook (1974) at Northwestern University has written about 
its purposes and methods. Some oi, our best methodologis ts have pursued 
secondary analyses in such grand style that its iniportance has eclipsed 
that of the prijnary analysis. 

But our topic is what we have come to call ~ not for want of 
a less imposing name ~ meta-analysis of research. In of 
us defined it thus: ^ 

« 

'"Meta-analysis refers to the analysis of analyses, 
I use it to refer to the statistical analysis of a large ^ 
collection of analysis results from individual studies for 
the purpose of integrating the findings. It connotes a 
rigorous alternative to the casual, narrative -discussions' 
of research studies which typify our attempts po make sense of 
the rapidly expanding research literature." (Glass, 1976, p. 3) 

And again, two years later: 

"The accumulated findlryrof dozens or even hundreds- 

of studies should be regarded as complex data points, no 

more comprehensible without the full use of statistical 

analysis than hundreds of data points in a single study could 
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be so casually understood.. Contemporary research reviewing 
.ought to be undertaken in a style more technical and statis- 
tical than narrative and rhetorical. Toward this end, I 
have suggested a name to make, the needed approach distinctive; 
. I -referred to this approach as the meta-analysis of research 
(Glass,' 1976). I have no stake Jji the use of this term; it ' 
sounds pretentious, but is only incidentally so. It was (;hosen 
to suggest the analysis of analyses, i.e., the statistical 
analysis of the findings of many individual analyses/' 
•(Glass, 1978, p. 352). 

And two years later still: 

' /■ 

The approach/to research integration referred to 
. as 'meta-analysis' i^nothing more than the attitude of 
Ik, data analysis applied to quantitative summaries of individual 

li2^^>^f^-/;« -^JLJ^r^2^1^^ ^'^^ njooerties of studies and 
mJ|^ta*anal:^sls tor researcn invites one who would integrate 
numerous and oiverse findings to apply rhe full power of 
statistical methods to the task. Thus it is not a technique; 
rather it is a perspective that us'es many techniques of* 
measurement and statistical analysis." (Glass, 1980, p. 2). 

The essential character of meta-analysis is that it is the 
statistical analysis of the summary findings of many empirical studies. 

* ^g^a-Analysis Is Quantitative 

Meta-analysis is quantitative. It is undeniably quantitative; 
and by and large it uses numbers and statistical metho'ds in a practical 
#way, namely, for organizing and extracting information from large masses 
of data that are nearly incomprehensible by other means. Nuraerosity 
' creates many of the problems of research synthesis; naturally, numerical 

methods are employed in"-their solutioa. 

Meta-An alysis Does Not Prejudge Research Findings in Terms of- Research 
. Quality J 

The findings of studies are not judged a priori or by arbitrary 
and non-empirical criteria of research quality. In this respect, 
meta-analysis differs greatly from other approaches to research 

O ' ' 19 ' 

ERJC ^, 



integration • Typical narrative' reviews attempt ta deal with multi- 
plicity by'arbitrary exclusion. The^ dissertation literature is 
excluded because it may be believed that any worthwhile study would 
have been published. Huge numbers of studies are excluded on methodo- 
logical grounds; poor design, b^d measurement, badly implemented 
treatment, and the like. Yet, evidence is never given to support 
these arbitrary exclusions. 

An important part .of every meta-analysis with which we have 
been associated has been the recording of methodological weaknesses 
in the original studies and the examination of their relationship to 
study findings. Thus, the influence of study quality on findings 
has been regarded as an empirical a posteriori question, not an ^ 
a priori matter of opinion or judgment used to exclude large numbers 
of studies from consideration. 

Meta-Analysis Seeks General Conclusion s 

The most common criticism of meta-analysis is that it is 

^ illogical because it mixes findPings from studies that are not the same; 

it mixes apples and oranges. Implicit in this concern is the belief 

that only studies that are^^^tB^ same in certain respects can be aggregated. 

The clai» that only studies which are the §ame in all respects can 

be compared is self-contradictory; there is no need to compare them 

'since they would obviously have the same findings within statistical 

# error. The only studies which need to be synthesized or integrated are 

different studies. Generalizations will necessarily entail ignoring 
♦ 

some distinctions that can be made among studies. Good generalizations 
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will be'^arrived at by ignoring only those distinctions that niake 
^no important difference. But ignore we must; knowledge itself is 
^ possible only- through the orderly discarding of ifrf onnation . 

Yet it Is intuitively clear that some differences among 
studies are-so large or critical that no one is interested in their 
integration^ What, for example, is to be made of study ^'1 which 
demonstrates t^e- effectiveness of disulfiram in the treatment of . 
alcoholism and study H which demonstrates the benefits of motorcycle 
helmet laws? Not^much, I suppose. Sut i: hardly follows that the, 
integration of study //I on lysergide treatment yof alcoholism and study 
*2 on "controlled drinking" is meaningless; one is understandably 
concerned with which trea^tment has a greater cure rate. Is the 
essential difrerence' between the two examples that in the former 
case the problems addressed by the studies are different" but the 
P^Q^^^^ is the saine In the latter example? "Problem" is no better 
defined than "study"^or "findings," and invoking the word clarifies' ' 
little. It is easy to iniagine the Secretary for Health comparing fifty 
studies on alcoholism treatment yith 5ifty studies on drug addiction 
treatment or ^ hundred stuidies on the treatment of- ob^ity. If the 
two former groups of stifdies are negative and dhe latter is positive,, 
the Se^cretary may decide to fund only obesity treatment . centers . From 
the Secretary's point of view, the problem is. public health, not simply 
alcoholism o£ drug '^addiction treatment. 

^- There exists another - respect in whic^ it is inconsistent to 

critttize meta-analysis as meaningless because it mixes apples and oranges, 

' ' '21 ' ■ • ■ 
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Data analyses of primajry re'search, are traditionally performed by 

lumping together (Aceraging or otherwise aggr^ating In analyses of 

■« 

^ariancg^ t-tests ""^nd whatever) data from different persons . These 
' . ' ^ ' ^ 

^ persons are as different and as much like apples and oranges in their 

waj as studies are different from each other. Yet to object to 
pooling the findings of studies 1, 2,^. . 10 and see nothing at all 
objectionable in pooling the results from pers^jns 1, 2-, . . ., 100 
is inconsistent. Now one Ciight think that the two kinds of aggregating 
I*' identified are qualltkrively different; but it would remain to be 
I specified exactly how they are different and why it matters, which 
would necessarily entail presenting empirical eviden^fe to. demonstrate 
t^hat studies using different pppulati<^s, measuring instruments, 
data analyses, etc. are fundamental^ incommensurable. The ironic 
dJ t^ma posed here is th^t such an empirical demonstr|£ion would be 
of ^itself an analysis of exactly the type which we have referred to as 
a "meta-analysis." 

Reta-analysis is aimed at generalization and practical simpli- 
city. It aims tq derive a useful genei^alization that does not do violence 
' * to a more ufeeful contingent or intetactive conclusion. The world runs 
on ^generalizations and marginal utilities. ^ They represent synthesis; 
sciencfe mns on analysis. ' Therein lie many of the difficulties, that 
■scientists and men of practical affairs encounter when they meet. 

Our approach, meta-analysis, has been mistinderstood — a 
* • ■% < 

. ^ Atcurastanoe for which we tnust accept that share of the responsibility 
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due us. I: has been characterized by some as "averaging effect sizes," 



which is a lictle like characterizing analysis of varianc^^ as "adding 
and multiplying." The sine qua non of what we call meta -an alysis is the 
application of research" methods I^Lhe characteristics and findings of 
research studies. 3y "research methods" is meant such considerations 
as are normally addressed- in conceptualizing, designing and analyzing 
empirical research: problem selection, hypothesis formulation, ^ 
definition and ri^asurement of constructs and variables, sampling, data 
analysis (see Kerlinger, 1964, o'r many others). 

rne method^ of meta-analysis have nuch m common with those of 
survey research, for in fact, research review and integration is a 
process of surveying and analyzing in quantitative ways large collect- 
ivities. Many of the issues faced in a meta-analysis are akin to the 
problems addressed in survey design and analysis (cf. Kish, 1965). The 
similarity between the two should not be taken as implying that 
meta-analysis shares with survey research the latter'jB limitations as 
regard? the analysis of causal cU^ms. Survey research continues to 
struggle with the problems of unknown* third variables and ambiguous 
direction qf causality. Meta-analysis, on the other hand, througli no 
great accomplishment of own, may very well be applied to the 

findings of a literature of controlled experimental studies, each of 
his has a valid clain^on a causal conclusion. 

We do not wish, to imply that a clear break can be discerned 

» 

between earlier methods of resc^arch integration and meta-analysis. 'In 
fac^t, under the pressure of numbers, research reviewers have gradually ^ 
of necessity adopted increasingly rigorous 'and quant?4.tative methods 
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of study integration in the past thirty years. For exaru^lB, Underwood ^ 

(1957) found 16 experiments on ^the link between mettory and interference 

when he attempted to integrate the existing research. The standard 

designs and the near standard measureznents common to the 'studies suggested 

a more quantitative amalgamation of the evidence than was typical in 

research reviewing at the time, 3y graphing the number of lists of 

items to be recalled in these ^ experiments against the percent correct 

recall on tne last list. Underwood obtained an orderly and convi.ncing 

pattern describing the relationship (see Figure 2,1). 3y portraying 

muitip'le findings quantitatively and^ aggregating across some potentially 

« 

irrelevant distinctions (e,g,, lists of geometric forms vs. nonsense 
syllables; paired-associate vs , serial presentation, long lists vs, 
short lists). Underwood discovered a convincing and importan; finding 
not apparent in the disparate constituent studies, 7nis is the essen(;e 
of the meta analysis approacb, 
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Readers, of review journals ^ (e.g. , Psycholos^ical Bulletin , 
Review of Educational Research , American Sociological Review ) have 
become familiar with increasingly more elaborate forms of research 
integration. Long lists of coded descriptions of research literatures 
have become common. Contingency tables showing proportions cf 
"signif i<!fcit results" under various conditions are more and more 
a standard feature of integrative reviews. These developments were 

equired by the complexity of the reviewing task, and they are in the 
spirit of the methods we present here. We hope to have advanced 
these mefhJds by appropriately increasing the quantification and 
analysis of the task so that the full value of modem statistical 
methods is realized. 

Rosenthal (1976) integrated the findings of several hundred 
Studies of the experimentey^expectancy effect in behavioral research, 
rnte techniques he used and his discussion of methodology were remark- 
ably like those presented 'in Glass (1976); though the two efforts \ 
(borne of similar necessities) proceeded independently. In the five 
years since our work has been publicized, the methods developed and 
recommended have been applied repeatedly and in diverse areas: 
treatment of stuttering (Andrews, 1979), modem vs. traditional math 
instruction (Athappilly, 1980), "process oriented" science instruction 
(Bredderman, f979), mainstreaming of special education students (Carlberg, 
1979), neuropsychological assessment of. children (Davidson, 1978), 
"inquiry oriented" science teaching (El-Nerar, 1979)", transcendental 
•"meditation (Ferguson, 19«0), teaching style and pupil achievement 
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^(Glass , 1977; Gage, 1978), social-psychological environments 

and learning (Haertel, Walberg and Haertel,' 1979), sex *iif f erences 
in decoding verbal cues (Hall, 1978), individualized niatheniacics 
instruction (Hartley, 1977), effects of television on social behavior 
(Hearold, 1979), validity of employroent tests (Hunter, Schmidt, and 
Hunter, 1979), home environment and learning (Iverson and Walberg, 
1979), psycho-linguis'tic training (Kavale, 1979). treatment of 
hyperactivit iy (Kavale, 1980), racial desegregation and academic 
achievement (Krol, 1979), personalized college-level instruction 
(Kulik, Kulik and Cohen, 1979), advance organizers (Luiten,' Ames 
and Ack^rson, 1979), drug therapy and psychological disorders (Miller, 
1977; and Smi^h, Glass and Killer, 1980), test validity in personnel' 
selection (Pearlman, 1979i, teachers* questioning style (Redfield and 
Rousseau, 1979), psychotherapy and medical utilization ,(Sc>4lesinger , ^ 
Mumford and Glass, 1978), psychotherapy and recovery from medical 
crisis (Schlesin^er , Murriord and Glass, 1979), aesthetics education 
and basic skills (Smith, 1980), seoc-bias in counseling and psycho- 
therapy (Smith, 1980), class-size and affective outcomes (Smith, 
and. Glass, 1979), psychotherapy outcomes (Smith and Glass, 1977), 
motivation and achievement (Uguroglu and Walberg, 1978), socio- 
economic status and academic achievement (1976), relationship 

^between attitude and achievement (Willson, 1980) patient education* 
programs in medicine (Posavac, 1980), correlatio'n of auditory 
perceptual skill and reading (Kavole, 1980), diagnostic/remedial 
instruction and science learning (Yeany and Miller, 198t)j treatment 
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of migraine and tension headache (Blanchard, Andrasik, Anies, Teders 
and O'Keefe, 1980). effect? of iirect versus open instruction (Peterscxr 
1978). 

Illustrations, of Meta-Analysis 

Meta-analysis has been nisunderst^ood and criticized, th§ 
criticisms often gathering their force fron: the misunderstandings. 
But the objectit^ns raised to* meta-analysis are the subject of the 
final chapter. 'In the rer.ainder of this chapter, we wish instead to 
elaaorate on the verbal characterization of meta-analysis by describing 
briefly several applications of the method. 

Psychotherapy and Asthma . Twelve studies were located that 

f 

tested the effects of psycnotherapy on asthma. Eleven studies used 
tr'eatraent and control group designs; two designs were pretest versus 
posttest. 

, The summary of the data and findings appear as Table 2.1 

which offers the following items of information about each study: 
a) Author(s); b) typ# of therapy; c) average age of sucJiects; 
d) number of ho.urs ot thw^apy given; e) the nature of the control group 
(no treatment, re'l:;xation therapy, medical treatment); f) the number 
of weelis -elapsing be-cween the end of therapy and measurement of the 
outc^*- variable; g) the nature of the dependent (outcome) variable; 

h) the effect (ES) achieved in the study, treatment mean minus 

the control nfean divided by the control group standard dev- 
iatioTi, viz. , ' 

' ES, ■ X ' - Xcontrol . 

. ' c contrcl 

, • ^ 27 X ' 

.1':) ' ' • 



Table 2,1 



Study 

KToore (1965) 



Sclart^, et al. 



Yorkstun et al 



Therapy 
Type 

(b) 



Rec iprocal 
Inhibition 



Psycho- 
dynamic 

Verbal Desen- 
s i t i zat ion 



Findings of 11 Studies of Psychological Treatment of Astlima 




1/2 adults 
1/2 children 

30 
(19-^2) 

. • ^2 



CO 



Haljer-Louglman, 
et aj[. ( 1962) 

Citron, K, M. 
(1968) 

Grocn & Pelser 
(J960) 



Barendregt 
(1957) 



Hypno- I 
Therapy 

Hypno- 
therapy 



Psycho- 
dynamic (group) 



Eclect ic 
(A dynamic) 



Ago, eX aj_ 
(197^) 
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Eclectic 
(4-somat Ic 
4-therapy ) 



1 

V 



25 



30 



Hours of 
Therapy 

(d) 



2^ 

/ 



12 



50 



100 



20 



Cont rol 
Group 

(e) 

Relax 
Training 



Plifysical 
Trea tmen t 

Relax 
Training 



No 

» t reatment 

Relax 
Train ing 



Modical 
t rea t men t 



Medical 
t reatment 



Medical 
t reatment 



Follow-^up 
Time (weeks) 

(f) 

0 



0 
96 

96 




120 



Dependent 
Variable 



Lung f unc t ioning 
No. asthma attacks 



Emission of 
symptoms 

Lung functioning 

Psychiatrist ' s 
rating of improvTiient 

Use of drugs. 

Symptoms , 
wheezing 

Symptoms, 
wheezing 



Rated 

Improvement 



ES 
(h) 

\.U 1 

.88 



.66 

1.00 

1.00 
1.32 

.6A 



5f 



a .36 



.57 



Increased hostility, 
decreased "oppression h 
damage; Rorschacli 

Remission of 
asthma ^ 
symptoms ^■^ — ^ 1.51 
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Table 2.1 continued 



S tudy 
(a) 



Therapy 
Type 

(b) ' 



Findings of U Studies of Psychological Treatment of Asthma 



Age 
(c) 



Hours of 
Therapy 

(d)' 



Control 
Group 

(e) 



Fol low-up 
Time (weeks) 

(f) 



Dependent 
Variable 

(g)- 



ES 
(h) 



Kahn (1977) 



Kahn , e^ al . 
(1973) 



Alexander 
et al. 



McLean, A. F, 
(1965) 



Arnoff, G. M. 
et al. 



Counter- 
conditioning 



12 



15 



Counter- 
conditioning 



Jacobson relaxa- 
tion training 

Hypnotherapy ' 



Hypnotherapy 



11 



12 



11 



10 



15 




1/2 



No 

treatment 



Medical 
treatment 



No 

treatment ' 
None 

(pretest vs. 
post test ) 

None 

(pretest vs. 
, post test) 



32 

32 

32 

AO 
40 

AO 



12 



Use of drugs & .29 
medicat ion 

Hospitalization . 19 

Asthma attacks .24 

No. of ER visits .76 

Amount of drugs & 1.11 
medication 

No. of asthma aLLncks .66 
(one hospitalization 
in control group, none 
in therp.) 

Pulmonary functioning .82 
(peak expiratory flow) 

l/heezing Score ^ l\2 3 



Forced lung 
capacity 

Peak air flow rate 
Dyspnea 



0.71 . 

0.67 
1.25 
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The overall (i.e., summed across all studies) measure of 
impact of psychotherapy on asthama is depicted in Figure 2,Z. 

.8^: cr 




80th Percentile of 
Control ^roup 

Figure 2.2 Average effect of psychotherapy on asthma outcome 

measures across 13 studies which* included 23 outcome 
variables. 



The average effect comparing therapy and control groups was 

.85 (T^ , i.e., the average subject who received psychotherapy 

was at .85 standard deviatiions above the mean of the untreated controls. 

(The standard deviation of the 23 effect size measures is Orc " .390; 

thus, Che 95% confidence interval of the true average ES is 

,85 - 1.96 (.390) - (.69, 1.01,). It follows Dhat the average 
•V7T I 

therapy subject exceeds 802 of the untreated controls on the aggregate 
outcome variables. 
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There were six outcome measures in the thirteen studies ^ 
that assessed the use of medical services: use of medicine, 
hospitalization, emergency room visits. The average effect size for 
these six outconies was ES » .73. The two summary ^ef feet sizes — 
.85 for all outcomes and .73 for direct medical services — compare 
favorably with the effects of psychotherapy on outcomes such as 
fear, anxiety, and self-esteem. 

The relationshipSbetveen the effects of psychotherapy and 
sane features of the therapy and the patients is examined in Tables, 
2.2 through 2.5. 

Theraov Tvpe : The average effect sizes by type of therapy are 
as follows: J 

Table 2.2 



Behavioral 



n: 
ES 



ES 



12 

.80 
.42 



Type of Therapy 
Psychodynamic Hypnotherapy Relaxation 



1.03 
.41 



• .84 
.79 



.82 

0 



The differences among the effects of different types of 
therapy are not large, and in no case do they reach conventional 
levels of statistical significance. 
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^' Age of Patient ; The distribution of patients' ages (averaged 



within each study) is as follows: 



Table 2.3 
Age 

10-15 16-20 21-25 26-30 31-35 36-4041-45 



rrequcncy : 



* The linear correlation between age of patients (at the study 

level) and ES is +.40, which is reasonaljly statistically significant 

• / 

(standard error of £ « .21). • 

Hours of Therapy : The distribution of- duration of therapy in 
hours for the 13 experiments is as follows: 

Table 2.4 
Hours of Therapy 

1-5 6-10 11-20 21-50 51-100 > 

Freq. 9^1 10 2 1 X. - 21.3 

ES. : 1.03 1.23 .64 1.01 .57 



The linear correlation of "hrs. of therapy"' and ES across 
the 23 outcome measures is - .15, not significantly different from 
zero. 
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Follow-up Time: The follow-up times for measurement of effects 
for the 23 outcome measures were distributed as' follows: 



Table 2.5 
Weeks Post Therapy 
12 24 32 40 96 



120 



Frequency: Hi , , 3 ^ 1 x. - 25.9 



-81 1.23 1.36 .30 . .84 1.26 



1.51 



The linear correlation of "weeks post thelpapy" and ES is 
.34. not significantly different from zero at any respectable 
significance level. 

Psychotherapy (primarily behavioral therapies and hypno- 
therapy) shows impressively large effects on ameliorating the effects 
of asthma. The effectVare even substantial on the reduction of 
utilization of direct medical services, shoving a reduction in' utili- 
zation such that only 23 percent of the therapy subjects used as many 
medical services as half the control subjects. It is important to 
not,, in this regard, that in 5 of the 11 experimental vs. control 
group studies, the control group received medical treatment that wa^ 
not £ivan to the psychotherapy group. 

Peychotherapv and Alcoholism. In Table 2.6 appear data from 
15 experiments on the effects of psychotherapy oa alcoh61ism. In 
successive columns appear jh^ following information about each study": ' 
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investigacor(s) and year pf|the study; 
b) The type of therapy adro-inistered (e.g., behavioral 
modification, eclectic, psychodynamic) ; 
- c) The number of hours of therapy administer.ed; 

d) The number of months after therapy/at which> out^es. 
were .measured; 

e) A definition of ^'.'success" for the outcome measure; 
U The percen^ge of "successes" in the therapy -group ; 
g) The percentage of "successes" ift the control group; 
hi • The differential success, l\ f) minus g) , above! 
Summiry taifclulations of a f^v^characteriscics of t!>e studies 

m Taole 2.6 are presented belo/: 

■ f 

Type* of therapy: 11 studias used non-behavioral therapy. 

9 studies used behavioral therapy. 
Distribution of hours of treatment: „ 

Hour s 

1-10 11- 20^^ ' 21-30 31-AO ^-50 51-60 

Frequency:. 4 9 3 0 -2 1 

r 

. Distribution of follow-up times: 

Months Pajtt Therapy 
0 L-3. Azi^, 7-9 10-12 ^ 
'Frequency: A Q iQ , 1 5 ' ' 
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, Of course, interest cenceiTs pricarily on the outcome' measures. 
There exist two approache's ^ to summarize the outcomes: 1) the data 
can be pooled across all studies to calculate aggregate "success" • 
rates, or 2) the "success" rates can be averaged across the 15 studies 
The first method gives a study an import^ce in the aggregate which 
is proportional to its sample size, which »could be desirable in some 
instances but probably isn't in this instance. The second method 
weights each study equally, in effect. 

By the first method of aggregation, one finds 651 patients 
treated ^th psychotherapy with 269 reported as "successes" for a 
success rate of Al percent. The comparable figures for the control 
condition are 638 cases, 222 "successes" for a "succ8ss"''rate olj ^ 
33 percent. The Al percent vs. 33 percent difference is not very 
-impressive; but it may not be very fair. Note that a, few studies 
like Gallant a971) and McCance aJid McCance a969) carry unreasonably 
large weight in deteraining these aggregates because b.etween them 
they account for n^rly half of all the therapy cases. 

Averaging success rates across studies seems preferable.' 
Doing so yields "success" rates of 51 percent and 33 percent for. 
psychotherapy and.cpntrol conditions, respectively. These figures 
are probably more defensible than the Al percent vs. 33 percent 
figures. Even, so, a "success" rate of 33 percent for untreated 
controls is unusual and indicate! that the experiments were probably 
conducted under favorable circumstances"^ wit^h other' than chronic "• 
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Vogler et al^. ' 70^ 

Cadogan '73 
Clancy et al_. '69 

Gallant '71 
Gallant e_t al^. 
Gallant et al , ' 68 



, Hunt^£f Azrin '73 

McCance £r 



McCance 



McCance 



^lcCance 



'69 



'69 



' 69 
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Table 2.6^ 
Results of Outcome Stud^^es on 
Psychological Treatment ot Alcoholism 



Outc 



Type of Hrs, of Mos , post-therp. Type of Percent 
Therapy Therapy for follow-up OutCQfnB Success 

d) e) in lyerp. 



Beh. Mod 

Eclectic 
Beh. Mod, 



15 

18 
4 

50 



Eclectic 

Psychodynam, 50 

Psychodynam, 60 

Eclectic 75 

Psychodynam. 12 

Psychodynam. 12 

Beh. Mod, 6 

Beh. Mod. , 



8 

6 

12 

0 
0 
0 

0 
6 

12 



12 



■5S 

Percent 
Success 

in control 

91 



Not relapsed 
into alcohol- 
ism 

Abstinence 
Abs ti nence 

Sobriety 

Sobriety 

Abs ti nence 
"or nearly so 

•Abstinence 

Abstinence 
or nearly so 

Abstinence 
or nearly so 

Abs tihencG 
or nearly so 

Abstinence 
or nearly so 



14 _5 

25 = 56%;12 = 42% 



18 

20 = 90 



_4 

20 



25 ^= 24 
17 
140 
_2 
21 



_7 
10 
_7 
8 

"20 
31 

11 
30 

2A_ 
45 

11 
45 



= 10 

= 70 
= 88 

= 65 



_2 

17 

12 70 
_1^ 
21 



_1^ 
9 

8 

23 
51 



23 

43 T9 



= 53 



= 53 



23, 
51- 



= 20 

= 18 
= 4 

= 5 

= .11 

=' .1 3 
'X 

-'r 

= 45 



6% 



11 

49 = 47 



.41 



70 
6 

V 

8 
5 

59 
74 



V 



8 - 



< 



Table 2.6 (continued) 
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a) 


• 


Type of 
Th p ra n v 

b) 


Firs, pf 

1 n L a py 

c) 


Mos « 

for 

I or 


post- 
foil ow 

d) 


therp. Type of . /cent 

-up Outcome iccess 

in Therp, 
e) f ) 


Percent 
Success 
in control 

g) 


% 


Kissin ot al. 


•70 


Psychodynam 


. 20 




6 


Abstinence 
or nearlv c^o 


22 




\ _1 - 

" 44 = 5 


30 


Kissin Gt al. 


• 70 


Psychodynam 


. 20 




6 


Abs ti nence 
or near ly so 


5 

33 = 


W.5 


2 

41 = 5 


10 


Sobell & Sobell 


• 73" 


Beh, Mod, 


25 




6 


Full or part- 
t i me emp 1 oy , 


21 

•J ~J 


fi 0 


14 

35 = 40 


20 


Sobell & Sobell 




R<=>h Mori 






1 2 


Full or part- 
time employ , 


21 

35 = 


60 


16 

35 = ,46 


14 


Levinson & Sereny '69 


Eclectic 


30 




12 


Slight or 
much improv , 


15 

26 = 


58 


17 

27 = 63 


5 


NewLon & Stein 


'72 


Eclectic 


15 




6 


Not readmitted 
to hosp. for 
alcohol . 


10 

1 — 

'15 = 


67 


n 

16 = 69 


2 


Newton fir Stein 


•72 


Implosive 


15 




6 


If . M tt tl 


7 

15 = 


47 


11 

16 = 69 


22 


Ashom fit Donner 


'68 


Beh • Mod. 


-J 




c 

D 


Sobriety 


6 

15 = 


40 


_0 
8 = 0 


40 


Storm fi, Cutler 


'70 


Sys . desen , 


12 




6 


Some or 

marked improv. 


10 

15 = 


67 


m 

39 

62 = 63 


4 


9 

.^5 
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alcj^holics. But e^o^n if the 33 percent base-rate figure is unreal- 
istic» the 18 percent gap between treatment and control groups is 
not. One can conclude that on the average 20 hours of psychotherapy 
produces 18 '^successes** (sobriety 6 months after therapy) out of 

w 

every 100 persons treated . 

The percentage "success" rates can be transformed into a 
metric measure of effect by means of the probit transformation (Glass, 
1978). A discrepancy of 51 percent to 33 percent corresponds to a 
metric measure of effect of -^,96 standard deviation units. 
Expression of the e^\fect in this way will permit comparison of the 
effects across problim areas such as alcoholism, asthma, and surgery. 

The relationship of the differential success rate to follow- 
up time and amount of therapy was also studied (s^e figure oelow,) 
The difference in percentages of "successes" between treatment and 
control groups diminished -across follow-up intervals. Immediately 
after therapy, there were 37 percent more successes in the therapy 
group than the control group; at six months after therapy this 
difference dropped to 25 percent; at twelve months it was 3 percent, 
i,e., the rate of sobriety is virtually the same in the treatment 
and control groups, the treated patients having relapsed. Apparently, 
for the benefits of the therapy to be sustained, it must be readmin- 
istered at periodic intervals. 

Finally, the correlation across the 15 studies between the 
number of hours of therapy and the differential "success" rate was 
positive and reasonably large: +,49. More therapy was better than 
less. 

38 

' 7. 



r' 





U 






D 


en; j" 


0 


o 




6 




70. 


> 


— 




a 

(Z 


O 




her 


•J 


so: ' 








o 






ch 






> 










3c: r 










o 






CO 








IQI - 


O 








r 


c: ' 








c 






V 






u 




- ic ^ 










i 





DECAY OF TREATMENT EFFECT 
ALCOHCLIS V. 



Id 



F 0 1 1 0 w - u p Time - M o n m 3 
Solid wine connects averages at 0, G. and 12 momhs 



ERIC 



School Clas8-Si2e and Aohievemenc 

The literature on school class-size and i:s relationship to 
achievement has lain about for many years. Some of the first empiri- 
cal research in education, that of Joseph M. Rice in the 1890' s, 
examined the association between class-size and learning. In graduate 
school in the 1960's. I was taught that the two were unrelated and there 
was little point pursuing the matter.' A faint aroma of Chippendale 
(unwieldy and antique) still clung to the topic when in 1977 a 
friend at the Far West Uborato^y. Leonard Cahcn, suggested that" we 
apply ,to the class-si^e literature the techniques we had developed for 
integrating outcome cxpcrlacnts in psychotherapy. The SS.'OOO contract 
he dangled bef.ore us, made the problem seen: worohwhiie. 

« 
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The literature on cl^^-size and achievement had been reviewed 
repeatedly. The reviewers disagree^! wildly. One could document this 
confusion; it would be simple to quote reviewer X claiming that large 
classes are better, reviewer Y that small classes are better, and 
reviewer Z that neither is 'better. But to do so would only embarrass 
others and add nothing to one's appreciation of the cowplexity of the 
research. The problems with previous reviews of the *class-size 
literature are several: (1) literature searches were haphazard and 
often overly selective; dissertations were avoided, as a rule, and few 
reviewers sought out large archives of pertinent data; (2) reviews were 
typically narrative and discursive; the multiplicity of findings could 
not be abecrbed without quantitative methods of reviewing; (3) reviewers 
chat attempted quantitative integration of findings made several mistakes 
~ tney used cr^id^ classifications of class-sizes; and {U) they took 
statistical significance of differences far too seriously^ 

Our search for class-size- studies was carried out in three places: 
ri) document retrieval and abstracting resources; (2) previous reviews 
of the class-size literature and (3) the bibliographies of studies 
once found. The Ei^ld system ancj Dissertation Abstracts were searched 
completely on the key words "size," *'class size," and "tutoring." 
The dissertation literature was covered *as far back as 1900, and the 
fugitive educational research literature was covered from the mid 
1960*s to 1978. Of the many hundreds of doctoral dissertations scanned 

I)i^ssertation Abstracts , about thirty micro-film copies were purchased. 
A dozen dissertations were eventually incorporated. The -Journal 
literature on class-size was located in the traditional way; one or two 

40 



current reviews of the r-esearth,uere found, the articles cited were 
located, and the articles cited in these articles were located in 
turn. About 300 documents were obtained and read. One hundred-fifty 
of then were found to contain no usable data, i.e., no dat^ whatsoever 
.were reported on the comparison of- staall- and large-class achievement. 
About 70 studies examined the relationship of class-size to non- 
achievement outcomes and classroom process variables. Approximately 
8C studies on the class-size and achievement relationship were included 
in the meta-analysis. 

It is difficult to estimate what portion of the existing litera- 
ture was captured by this search. Even though 80 studies exceeded by 
50 percent tne most .extensive reviews published to that time, perhaps 
less than half of all studies that exist on the topic were found. 
Some studies (credited to school ^districts) could not be located even 
after several phone calls and letters. Other studies w&re surely 
missed because of odd or nondescript titles. Fortunately, the ERIC 
system uses key words based on the contents of a papet and not titles 
alone. Several studies found in the journal literauture by branching 
off existing bibliogf aphies had neither "size" nor "class-size" in the 
title, evidence enough that several studies were missed because their 
t;icles' lacked the key words. Another complication concerns the use < 
of class-size as an incidental variable in studies focused on other 
issues. There are probably many such studies, and (jrjly a few of the 
jaost visible ones were located. 
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The research on class-size and its relarionship to achievement 

evolved through four stages: the pre-experimental era (1895-1920); 

♦ 

Che efficiency era (1920-1940); the large-group technology era (1950-1970); 
and the individualization era ( 1970-ptesent) . The boundaries of the 
eras are not impenetrable, and even toSay an atavistic throwback to 
the 19th century will appear in a doctoral thesis. At each new stage, 
the sophistication of research ne^thodology 'increased , and the question 
of class-size and its effect on achievement was examined with different 
motives. One discerns in the narration accompanying the numbers the cult 
of efficiency of the early parr of this century, the rising birth 
rate of the post-war *AO's, the advent of teaching technology in the 
*60's, and most recently the /teacher labor movement and declining 
enrollments. What was said about the data changed as new interpretations 
served emerging purposes, even when the data changed little themselves.^ 

The meta-analysis was to determine -^what the available research 
revealed about the relationship of class-size to achievement. Driving 
boundaries around this topic was simple compared to the difficulties 
encbuntered in defining psychotherapy, for example (Smith and Glass', 1977). 
Conventional definitions of achievement seem scarcely to have' changed 
over eighty years; and class-size is relatively easily described and 
measured . 

The quantification of characteristics of studies permitted the 

eventual statistical description of how properties of studies affect 

the principal findings. Such questions can be addressed as "How ^ 

does the class size and achievement relationship vary as a function 

of age of pupils'^" or "How does it vary between reading and math 
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instruction?" The first step was to identify those properties of 
studies that plight interact vith the' relationship between class-size 
and achievement. There is no systematic logical procedure for taking 
this step. One simply reads a few studies from the literature of 
interest, talks with experts, and then guessesjmodif ications can 
always be made later if needed. About 25 specif ic^ items were coded 
for each study. Some were more useful than ochers; several items 
were seldom reported in the studies. A coding sheet was devised onto 
which the information about each study was transcribed. A single 
study might fill several coding sheets, depending on how many different 
class sizes were compared, how many different achievement tests were 
reported separately for different ages or IQs, and so forth. 

The major items of the coding sheet were as fellows: (1) year 
of publication; (2) publication source (book, thesis, journal); 
O) subject taught (reading, matTh, etc.); (4) duration of instruction 
(number of weeks); (5) number of pupils in the study (different from 
class-size since there might be many classes); (6) number of teachers 
in the study; (7) pupil ability; (8) ^pupil ages; (9) types of experi- 
mental control (random assignmen- ts , matching, ett.); (10) achievement 
meiasurement (standardized test, ad hoc test, etf^);' (II) quantification 
of outcomes (gain scores, AMCOVA adjustment, etc.) 

A simple staeistic was desired that described the relationship 
between class-size and achievement as determined by a study. No matter 
how many class-sizes are compared, the data can be reduced to some 
number of pairs, a smaller class against a larger class. Certain 
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differences in the findings must be attended to if the findings are 
Latfer to be integrated. The most obvious difference is the scale 
properties of the achievement measure. Measurement scales can be 
standardized by dividing mean differences in achievement by the within 
group standard deviation (a method that is complete and discards no 
information at all under the assumption of normal distributions) . 
The eventual measure of relationship seems straight-forward and 
unobjectionable: • 



X. - X, 



(2.1) 



where; 
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X is the estimated mean achievement of the smaller class which 
S 

contains _S pupils. 
- X is the estimated mean achievement of the larger class which 
contains L pupils: and 
5 is the^>6timated within-class standard deviatiotT, assumed to 
be homogeneous across the r.^o classes. / m 

As a first approximation to studying the clais-size and achieve- 
m«nc reylationship, it is considered irrelevent that the particular 
types of achievement that lie behind the variable X are quite different 
knowledges and skills measured in quite different ways. Reports 
of research frequently? emit such basic descriptive measures as means 
and standard deviations. -This omission frequently complicated 



Che calculation of L^_^, but seldom obviaced it. Transf ornacions of 
commonly reported statistics (t_, F, etc.) into ^^'s were derived (Glass, 
1978). 

In all, 77 differeint studies were read, coded, and analyzed. ^ 
These studies yielded a total of 725 ^'s. The comparisons are based 
on data from a total of nearly 900,000 pupils spavining 70 years 
research in more than a dozen countries. In Table 2.7 appears the 
frequency distribution of ^ ' s by year ir, which the study appeared. 
It is clear from Table 2.7 that class-size research was an active 
early topic in educational research, was largely a*bandoned for 30 
years after 1930, and has been resurrected in the last 15 years. 
In Table 2.8,- the comparisons are tabulated by the t>T)e of assign- 
ment of pupils to the different size classes. Each of the first 
three types of assignment represents reasonably g^ood attempts at 
eliminating gtoss inadequacies in design; these three conditions 
account for slightly more than half of al^ the comparisons. • Even * ' 

though half of the. comparisons involved comparing naturally -constituted 
and non-equivalent large and small cla-sses, some of these were "based 
on ex post facto statistical adjustments for pre-existing differences. 
So the data are not half worthless; indeed, whether the experimental 
inadequacies inf luenced,3the findings is an empirital question — 
rather^than an a priori Jj^gaent — which was examined in the d^ta • 
analyses. In Table 2.9 appears the joint distribution of smaller and- 
larger class-sizes on which the 725 Z, ' s ^re based. For example, six 
-8 derive from comparisons of group sizes i and 3. (The table contains' 
only 550 entries instead of 725, "since comparisons would not be 



recot'ded in this ^tabulation if S and L were contained within the 
same broad category (e.g., if S ■ IG and % - 22.) 



^ Taole 2.7* 
Class-Size Comparisons by ^Year of Study 



1^ 



Cumul ati ve 



Year 


No, of.A's 






, 1900-1909 


22 


3.0* _ 


• ' 3v'ex 


1910-1919 - 


104 


25.4% 


28.42 


lS^O-1 929 ■ 


138 . 


is.o; - 


47.4; 


/i930-1939 


47 


6.5: 


53. 9r 


" 1940-1949., 


,1 . 


- o.o; 


53. 9i 


) 1950-1959.. 


- '62 


8.6% 


62.53 


■'• 1960-1969 . 


. 150 




83.3:| 


1970-i979. 


- 725 


16.7% 
IDO.O^ ' 


100. o; 



' - Table 2^S* 

Cla^s-Coiiiparisons [L) by A^H^nment' of Pupils 
to the Small and Large Classes 

■ ^ — 

Type of. Assignment No . of A ' s *_ 

Randonr ' '110 15'.2% 

Hatched • . ' ' 235 32. C 
"Repeated Measures" . / 18 5r 2.5i 

UncontroTl MM 362 49.9; 
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^dble 2.9 

3 

Joint Distribution of Smaller and 

Larger ^Class-sizes in the Comparisons br > 

^ * ^ 



1 _2 3 

: - 1 5 

2 0 



4- 5 
.6-10 
11-16 
17-23 
24-34 
>35 



Larger CI ass-si ze 



4-5 


6-10 


li-15 


'J-12 


24-34 




1 


3 


7 


1 


34 


•0 


1 


• 0 


0 


1 


0 


0 


0 


0 


0 


a 


6 


• 0 
















0 


0 


r 


2 


0 


/ 




8 


0 




2 








19 


44 


27 










78 . 


106 












197 



47 



0/; 



The simple staciscical propercies of the s were inceresring 
in themselves, even chough their full import required more sophisticated 
analysis: ' 

Properties of Distrubution of ^ 



I 

^) ^ " 725. d) Standard 'deviation • 0.401. 

'b) Mean - .088; Median « .050 e) Range: -1.98 to 2.54. 

^ . c) 40T;; of the -^^^ were negative; 

60%, positive. ^ 

^ On the average, the 725 -g.^^'s were positive, i;e., over all 
comparisons available ~ regardless of the cl^ass-sizes compared — 
the results favored the smaller class by about a tenth of a standard 
deviation in achievement. This finding is not too interesting, however, 
since it is an average across many different sizes of classes compared. 
However, only 60 percent of the ^'s were positive, i.e., favored • 
,the smaller class in acnievetnent , This is.so^ even though every 
effort was made to find studies spanning the full range of class- 
sizes from individual tutorials to huge lectures. One suspects that 
the odds ^f observing a positive L^^^ in the class-size range so oft6n 
studied (15 to 40, say) were even smaller, perhaps as low as 55 percent ^' 
to 45 percent. 

In these rough summaries, one of the fundamental problems is 
revealed that has made the class-size literature so difficult for 
reviewers. If tne relationshifi one seeks has only 55 to 45 odds of 
appearing and one looks for it vit^ut* all the Cools of statistical 



analyses that can be mustered, the chances of finding it are slight. ' 
One need'not wonder why,. narrative reviews of ^ dozen or two studies 
'produced little but confusion. 

To ffiike sense' of the/tlass-size and achievement relationship 
one must account for the/taagnitude of the ^'s and their variance in 

^erms of the sizes of /the smaller and larger classe*s. What was needed 

/ ^ 
was a continuous qu^titative model that would relate class-size 

C to achievement Class-size and achievement might be expected 

to be related in seething of an exponential or geometric fasten — 

reasoning that one pupil with one teacher learns some amount, two 

pupils learr. less, ^ ^ 

three pupils leana scill less, and so on. 
Furthermore, the drop in learning from one to cvo pupils might be 
expected to 'be larger than the drop from two to three, which in turn 
is probably larger than the drop from three to four, and so on. A 
iogarithaic curve represents one such relationship: 

2 = a - 3log„C + c, 

e (,1.1) 

Where C denotes class-size. Since 2 could be zero or negative, the 
model in (2.2) does not preclude the data shoving that class-size 
and achievement are unrelated or that larger classes learn more than" 
smaller ones. \^ 

In formula (2 . 2),p( represents the achievement for a ''class" of 
one per^n, since log^l • 0, and 3 represents the speed of decrease in 
achievement as class-size increas«§. Fonaula (2.2) 'cannot be fitted 
to data diVectly because z is nc^easured on a comraon scale across 
studies. This, problem was circumvented by' calculating '• for each 
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comparison of a smaller and a larger class wi:hin a study. Then, 
froo formulas (2.1) and '(2.2) one has: 



A 

"S-L 



(a - BloggS + ) - (a - 5109^1 + z^l 

= Sdog^L - log^S) + £i - 
t 

' 6log^(L/5) ^ t' 



(2.3) 



The sodel in formula (2.3) was particularly siaple and sjraighr- 
forward. The values of were merely .regressed onto che logarithm 

of the ratio of the larger Co the sniailer class-size, forcing the least- 
squares regression line through the origin. 
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Figure f,2 . ) 
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Peoression T^j^s tor the reoression of 
ac^ievement(exa^BSed in percentile ranks) 
onto class-size for stucies that \/ere \/ell 
controlled and rcorly-control led in the ^ 



assinnment of pupi Is^ 



The leagc-squar es estimate of the 
Che form: . ^ :( ^^_^) ( 1 09^1/5 ; 

ZdoOgL/S)^ 

* t 

The model in formula (2.2) was fitted to the data base as' -a whole 
and to nany subdivisions of it. The strength of the relationship berween 
class-size and achievement did not vary with characteristics of the studies 
(e.g., age of pupils, ability, subject taught) with one exception, Tne 
relatioTiShip was rauch stronger for studies in which pupils were rar.doQly 
assigned to the clasjps of different sizes than for studies that used matched 
or uncontrolled assignment; thus,' better controlled studies gave more positive 
re'sults. Hence/ we restricted our estimation of the relationship to the 
100 or so -^'s that arose from the well-controlled experiments. Aiter fitting 
the model in formula (2.2) to the data, estimating 5 and transforming £ 
to a percentile scale, the relationships in Figure I.^emerged . Assuming 
arbitrarily that the average pupil in a class of ^0 scores at the 50th 

A ■ 

percentile in achievement, his improvement in achievement as . class-^ize is 
reduced as indicated by the upper curve in the figure. Whence is taught 
in a class of 15 his achievement rises to the &Oth percentile; in a group 
of 10, he will score at the 65th percentile; and taught b| himself (class- 
size equal 1), he is expected to score above the 80th percentile. We 
ft 

concluded -our report with these words: 

A clear and strong relationship between class-size and achievement 
has emerged. The 'relationshfp is seen most clearly in well- 
controlled studies rin which pupils were randomly assigned to 
classes of different sizes. Taking all findings of this meta- 
analysis into accou?it, it is safe tc say that between class- 
" sizes of AO puolls and one :)upil lie more tn?\n 30 perrentile 
ranks of achievement. The difference in achievement resulting 
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frora instruction in groups of 2^0 pupils"°and groups of 10 can 
be larger than '10 percentile ranks in the central regions 
of the distribution. There is little doubt that, other things 
equal, more is leaded in smaller classes, 

(Glass and Smith, '1979, p. 15) 

* The iffi{)act of our findings was immediate. At first the word 
of the findings spread informally, through face-to-face contact, 
A friend mentioned the study during the interview or an entirely 
different subject vith the foreign education .writer for the "London 

Times," An article followed,^ then several others as one thing led to ' 

another. The process that ensued at that point more resembled Brcwnian 
movement than linear, heirarchial disseminati^^n . In a span of a year, 
synopses of the findings appeared in magazines (''Today's Education," ' 
"Psychology Today," "Forur."), newspapers ("New York Times," '^Denver 
Post," "London Times," A? wire service-), and were discussed in radio 
and television intervievs that must have reached millions of people. 
The phone bagan to ring vith questions and requests for documents. Parents, 
teachers, administrators, politicians (Pennsylvania, Georgia, Nevada, 
Colorado, North Carolina and Minnesota) — they either ^read about the 
study in the popular press or heard of it from an acquaintance. Teachers 
unions waved the report under the noses of boards and administrators; 
the latter criticized it as inaccurate or hired critics to discredit it. 

Sex Bias in Counseling and Psychotherapy 

Smith (1980) found 34 studies of possible bias of counselors 
and psychotherapists toward male vs. female clients. A typical study 
examined experimentally the possibility that counselors and therapists 
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^ varied their diagnoses, reconmendarions and attitudes toward their 
client depending on the client^^ gender . * The 34 studies contained 50 
assessments of possible sex bias. 

There was wide variation m- the- designs used, their adequacy, 
and the extent to which "client" individual differences were considered. 
However, each study was used in the raeta-analysis regardless of its 
qualities. Thus, the author's theoretical and methodological biases 
nad =;:.r.i:tal influence. The studies were rated for design quality so' 
that the aagnitude of sex bias produced by studies of different levels of 
design cuality could be ascertained. A score or 3 waj given for studies 
in which all experimental variables were controlled and the effects of 
client characteristics. A score of 2 was given to studies that merely 
nad expe^riaental variables under control. A score of 1 was assigned 
to studies m which experimental variables were uncontrolled or seriously 
confounded. 

^ Methods for transforming the analytic results of the studies into 

a common metric followed Glass's (1978) specifications. Each dependent 

variable frpm the studies was converted into an "effect, of sex bias" 

(ES3) according to the following formula: ESB - (!1 * M )/r 

?lale 'Tcmale 

In a scudy of the effect of client gender on therapist judgment 
or Client prognosis, for example, the mean for the prognosis ^.^g^ven to 
females was subtracted from the mean prognosis given to males. The 
difference was divided by their average standard deviation. The resulting* 
ESB is in the form of a normal unit devia:!e. An of i indicates that 
the mean of the males on the dependent variable is of the magnitude of 
1 SD Higher than the mean of the females on that variable. - In tne above 



example, an £S3 of 1 would _indica:e ^-a: counselors gave males a much 
nore favorable prognosis than cha: given to feinales; m fact, the average 
male prognosis is more favorable than the prognosis for 8A percent of the 
females, assuming a normal distribution of the sex bias variable. 

The ES3 is standardized so that different measures can be 
viewed gn a ccnmon, convenient netric and combined with others to 
form an overall picture of the sex bias effect. The dependent measures 
were arranged so that- a positive ES3 always meant bias against females 
or against ncnrraditional , nonccmf ormist , or androgynous actions, 
decisions, or labels. A negative ESB indicated bias in favor of females 
or nonconforming, nonstereotv-pic goals. One study illustrates this 
process. Price ~and Borgers (1977) compared counselors' ratings of 
appropriateness of course seiecti(3?i for boys and girls. The mean 
appropriateness rating given to boys was 3^5. The mean appropriateness 
rating given to girls was 3.^5. The average standard deviation for" 
boys and girls was .95. The ESB was ,05. That is, the rated appro- 
priateness was biased against females by a magnitude of .05 ^ units, 
a very small amount. 

Transformation of dependent measures into ESBs was straight- 
forward when means and standard deviations were given. When t, F, 
or cni-square statistics were .given, estimates cf - were found by 
backward solution^of statistical formulas. For example, an estimate 
of - can be found from a study in which only a value for the n*s, 
and the treatement means are reported by using the following steps: 
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where £ is Che number of groups and is the number of cases per group. 
More complicated procedures permitted the estisuation of-^ from designs 
vith biocKing variables and covariates, as specified by Glass (1978) and 
elaborated in McGav and Glass ( 1980). Special problems arose in 
the^ calculations of ES3 when the researcher reported only significance 
levels of effects; when, for example, the researcher stated that client 
sex produced no significant differences' o^ the dependent variable. In 
this Case, an ES3 of zero was entered for that variable.* 

Another problem was encountered in studies ^that reported item-by- 
item siifrt^i cance tests on sex-role stereotyping measures. The item- 
level dita were converted to ESBs, and. the average (ESB) for the item 
cet was recorded f6r that study. Except for these few studies in 
^ which multiple item-level data were averaged, the practice was to 
record an ESB for 'each dependent measure that the researcher reported.** 
rable2.10contains the ESBs calculated for the studies. 

The ESfi measures were accumulated by . the domain under inves?!* 
gation (counseling or psychotherapy^ and by the construct measurcfd 
(attitudes, judgments, or behaviors) and for other variables of interest. 



* If 
A check on this procedure was conducted after the meta-analysis 

was completed. Neither altering the procedure nor eliminating these 

findings from the sussnary qhan^t^^ tne final ESB py more than a fraction. 

** 

A l^ter check on the effect of ESB calculated at the level of 
the, dependent measure and at the level of the study showed no -differences 
in the magnitude ^of effect. 

. 6' ; 



Tne resulcmg sunnary statistics are contained in Table Z.Il.The means, 
standard deviations, and the number of effects are presented; along vith 
the standard error of effects (~) Whether the number of studies 
in a meta-analysis should be considered^'the entire population of studies 
on a topic or rather a sample of a hypothetical population of such 
studies :.s problematic. If tlje latter is true, then inferential statistics ' 
mi^t be appropriately applied to the effect-size measures, However, 
appropriate sampling distributions for inferential statistics in meta- 
^ analys:.s nave yet to De evaluated. Presentation of the standard error 

of effects allows the reader a rough-and-ready measure of the significance 
of difference of the means of two contrasting conditions (e.g,, ESB for 
well-controlled studies vs, ESB for poorly controlled studies). 
A difference in 'means less than two srandard errors in taagnicude was 
deeaed unreliable and did not figure into the discussion of results. 

*abxe ZJiconCams the sunmary statistics for the sex-bias meta- 
analysis. The overall mean of ESBs is gtven along with the mean for 
each construct domain , the source of the study, and the validity of 
tne design. 

The 'results are clear. There is no evidence for the existence 
of counselor sex bias when the research results are taken as a whole. 
The average ESB is -.OA, indicating that the counselor bias is near 
zero or even slightly in favor oL women and nonstereotyped actions f'or 
women. .The size of the sex-bias effect does not change from construct to 
construct. Attitudes, judgments, and behaviors* all show about the s'ame 
^sdze of effect. Considered separately', the findings labeled clinical 
stereotypes produced an ESB df .lU , which recapitulates the conventional 
O . 56 • 
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visdos that clinicians hold negative stereorypes about women,' When 
the standard errdr of effects is used to evaluate this, one'^finds that 
the ZSB for stereotypes is not reliably differervt from the ESB of the 
data as a whole. ^ - • 

The analysis of sex bias found in journals as opposed to 
dissertations is extremely interesting. Journal articles were much 
more likaly to show bias against women. Dissertations showed the 
opposite. One. is tempted to suppose that dissertations are more poorly 
« designed and executed and therefore less likely tc be published. Tnat 

supposition would be- incorrect , as the average ratirg of design quality 
was slightly higher for dissertations than for journals (2.57 and 2.16, 
respectively). Tne best designed studies — those in which experimental 
variables were well controlled and provision was made to isolate gender 
effects froE perscnai characteristics — yielded results opposite to 
those of the sex-bias hypothesis. Studies with moderate validity.— 
controlled variables but no provision for gender and case distinctions — 
averaged zero on the ESB variable. Studies with poor controls cr 
sev€r^ cpnf ounding of variables yield .the results most supportive of 
the sex-bias hypothesis. 

Analysis of interactions of variables failed to yield reliable 
results, ,vith one exception. There was a statistically significant ' 
interaction between design quality and publication status, but not in 
the predictable direction. Tablef .l2contains the' ESB and standard error 
of ES$ for the Design Quality X Source of Publication interaction. 
Studies published in journals were morg^ikely to show the effect of 
sex bias, regardless cf rhe qtiiitv cf tneir research design. Viewed 
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another way, studies most likely to be subr.itted or accepted for publi- 
cation tended to be those that demonstrated the sex-bias effect, their 
design quality notwithstanding. 
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Table 2.10 



Author Sourtf. Domain, Construct. Type of Effect, ami Effect of Srx Mins {ESB} of \liiilie^ 



IT 



Author 



. Construct 

Source Domain f/ftt!tuH(. 

f£)i99crtation (Psychotherapy yudgnicnt, or 

Of yournal) 'or Counsrhng) Srhavior) Vahflity 



A^hn (1975; 

Broverman. Broverman, (*larkv>n, Ro^enkrantr, 
A Voftrl (1970) 

Knr«lorsrk>rf (1970) 
Mayr^ A Woileat (1978) 
Maslin A'Davi* {!97S) 
hr^T (1975) 



Majifiel.l (1976) 
NruhnRer (1968) 
Snuth iI973) 
Wtrt 097 S) 



Ahramowitz, Ahrnniowiif, Jackson. St (»onif»s 
(1973) 

Aliramowtfz et a! (1976) 
AUanfowitx ct al (1975) 
DilhnRsly (1977) _ 
rUxRcrs. llen<Jrix. A Price (t077) 

s 

C orn (1975) 



l>onahuo (1976) 



J 

■J 

1) 



I) 
I) 



A 

A* 

A 
A 
A 
A 

A 
A' 

A 
A 



1 > {>e of effect 



Sox Mercoty[>rs 

S< x ntrrrotyf>rs of nirfilally 

hcnit hy prrsotjs 
Srx ^tirpotyprfi interests 
Sex nt rrrt)l y pes 
Sex stert^tyf>es ^ 
Accept ,1 nee of Self- ^ 

oru ritat ton 
S< X n!rr<Mit ypx*s 
Sex ^tereotyiK^s 
Sex ^trrco( > pe^ 
"! . v.iluntton" 
"rotency" 
/'Activity" 

psychological atjjust merit 
I*rf>gn<^i8 

P^ycholo^^ical adjustment 
f rent n»ent gonis 
Appropriateness of vocational 

choice ^ 
Uesire to trt at 
DfRrtH^ of im{)airnient 
Proj^nosis 

Krnjuneration of vocatrynal 

choice 
I (tuc.ition ref|Uire<l for 

\ «>cnl lon.il choice 
Su{H'r\i5ion refjuired ff^r 

vocal lonnl choice 



Y 



KSB 
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Drug Therapy for Psychological Disorders 

Miller (1978; also see Sirith, Glass and Miiller, 1980) sought 

to integrate a f ragmentajand videly scattered empirical literature on 

the effects of drug therapy on persons vith ^^bilitating psychological^/ 

disorders. A conventional .wisdom had long pervaded the field- ani /z 

both reflected and supported the political equilibrium that psychiatrists 

and psychologist's had struck. Ask most mental health- practitioners and 
V 

they wouli tiave told yo^^that verbal psychotherapy practiced by itself 
on the seriously disturbed (schizophrenic, psychotic) is.a'waste of 
^tuioe^ but combine it v.^tn drug treatment tvhich is effective in isolation) 
and the synergistic combination is much more beneficial than the sum of 
their separate contributions. Psychologists who believed t"nis would 
serve at the pleasure of psychiatrists, who are empowered by law to 
prescribe pharmaceuticals, 

Millfer found several thousand experimental studies that bore 
on the question of the relative efficacy of drug and psychotherapy 
effects. Most of these were Clinical trials conparing drugs against 
placebos. From -'his huge literature, Miller saicples at rkntom about 
fif-ty studies. The remainder of the literature comprised about 125 
e;)cperiments -that compared drugs and psychotherapy in various odd 
combinations (e . g drug-plus-psychotherapy vs, drug vs, psychotherapy; 
drug-plus-psychotherapy vs, place^bo), 
' Miller calculated th« stahdardized average difference on the ' 

' dependent variable for e»ch of the outcomes measured irf the experiofental 
comparisons in the studies. ' Nearly 550 effects were thus calculetad 

Sumaaries of the averages appe^ar in Table 2.13. Tnete one see*^ for 
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exanple, that in 55 comparisons of verbal psychorherapy vich an uncreated 
corticu group dr placebo, the psychotherapy group averaged ,30 standard^ 
de\^ation units higher on the outcome measiire. In 94 comparisons of 
drug-plus-psychotherapy with psychotherapy alone, the former averaged ' 

standard devia:ion units higher than the latter on the deperdciit 
\criables measured ir the experiments. Table 2,13 gives a parametric 
structure for the comparisons vith numeric parameters to be estimated 
..ron: tne cata. Sucn cuar.tif ica:ion is required of what are essentiaily 
quantitative questions abou: separate and iDteiactice effects cf drugs 
•and psychotherapy. Narrative and box-score surmaries are quite at a 
loss to coDe vith such problems. ' * 

^on^der now the problem of combining data in rable2J3.to obtain 
_ esrimates'of the parameters. That the ^drug-plus-psychotherspy v*,- 

drug conparison, vrich estimates f +.r,, is a full one-tenth standard 
^ deviation- larger than the .30 estimate of frcm the first line of the 
table might lead one to Relieve that r is positive; ' but the comparison 
of the estimates of 6 r and '5 (being ,44 and .51, respectively)^ 
reverses this impression. ParaJtter estimation bv inspection in this m 
way is too arbitrary and confusing. Several comparisons in the table 
contain information about the same ^parameters ; it seems reasonable that 
every source of information about a-parameter should be v^cd in estimating 
it. A corplete and standard method of combining the data in Table 2,13 
into estimates cf the parameters is needed. Such a iftfetho'd is suggested 
when one recognizes that the two middle columns of Table 2.13 
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constitute a sy^rer. of linear equations, tr.ree of tnen independent 
and containing triree unknovs (v f and . The method of least squares 
statist;.cai estiniation can be applied tc obtain estimates of the separate 

f . * 

^ and interactive effects of drug and psychotherapy. Tne estiiaates 
oDtamed by application of least-squares methodology to the data in 
Table 2.13 ar| as felloes: % * 

1 ' ' '"^^ 

- , tie separate ej^ect of drug tnerapy « .42 
\ " , tne interactive e'ffect of drug-plus-psychotherapy 



lacr effect is exx^ressed or. a scale of stancai^Mieviation units. 



Thus,-i^ie dfita cf Table 2.13 lead to tne conclusion^that vitr. the groups 
c: c-ier.ts stucied psycnotnerapy procuces outcomes tnat are ^aDout one- 
third standard deviation superior to the outcon^es from placebo or 
untreated control groups. The drug effect is only about a third 
greater than the psvchotherapv effect. An effect of .3is will move 

X 

an average client fror tne middle of the control group distribution to 
about the 62nd percentile; an effect of .^2 would move the average client 

- 1 ' 

to only aDput tne 66th percentile. The effects of the two therapies were 
conducted for only half the tin>€ it took to conduct the psy chotherapies 



(2,6 nonths vs. 6.1 ir.onths). Ar.y careful assessment cf the relative 
va^ue'of drug and psychotherapy will take both effects and costs into 

\ ' • * 

Arguments over the relative value of drug and psychotherapy 
will De simpler for the fact tnat tne interact ive effect of combining 
tne two jcKerapies is virtually zero ■ '.C^;. Tnis niu^t not be mis- 
understood as implying tnat drug-p lus-p?ycnptherapy is ineffective; 
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^ar izoTi It. Tne near zero ir.ieracrior, effec: cleans tha: vner. drug 
and psycnorherapy are conimed. one tan expec: benefits equal to the' 
sun of the separate drug and psychotherapy effects (.31 .42 « .73), 
net mere or less. 
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FINDING STUDIES 



RevieW:,ng and inregrating a research literature begins, 

obviously enough, vith the literature — often a videly-scattered , 

^ variegated landscape cf articles, theses, project reports and whatever. 

Jacicsor. ( 1975) snowed how this first step was occasionally taker. 

ratner ur^cer tair.ly oy reviewers. Of 36 reviews tna: Jacksor, analyzed, 

cr.ly cr.e reported havm.- sear ched tne literature with tne help of * * 

mcexes li^e P gvcnological Abstracts or Dissertation Abstracts; 

' ' ■ m 

only tr.ree of the 36 -reviews reported searching bibliographies of 
previous reviews cf the topic. Whietner reviewers do not take such 
obvious steps in finding studies cr take theri but neglect to say so 
My be inrr.ateriai fror. the reader's point^of view; in either case 
it IS difficult to judge whether the studies being reviewed represent 
most cf the e^;istin£ evidence on the question or only an unrepr^sen- 
tative portior.. Earlier we ''tikened meta-analysis to survey research; 
thus, finding studies is comparable in inrportance to sampling fr^ynes 
and methods in survey design "and analysis. Locating studies is the 
stage at which the most serious form of bias enters a meta-analysis, 
5ince.it IS a potential bias whose iinpact is difficult to assess. 
The best protection against inestimable sources of bias is a thorough 
t description. of the proctdures used to locate the studies that were 

^ 

found 60 that the reader can make an intelligent assessment o'f the 
representativeness and 'cornpleteness of the data base for a meta-analysis. 
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As ar. exaiTT^le of tne iengcns zo wnich one xignt sometimes ^ 
nave to go :c feel confidence of having, done a thorough job of finding 
relevant studies, consider Killer^ s (1976) experiences in reviewing 
an enormous literature on the psychological effects of drug iherapy* 

"Tc draw concLusions about the entire reairt of clinical 
drug research on psychclogi cal discrdecs, a sample was taken 
froz tne large nur,Der cf existing drug therapy studies. An 
atternpt \4as nad^ tc draw a representative sanule of all 
puclisnec clinical drug trials or. njentally il'l h'unians reported 
m the English language literature berveen 195^* and 1977, 

The only design recuirerient for inclusion m the sarple was 
tna: s tucies er.plcy a nc-crug treatr^ent cr a placebo control 
grcu;. Tr.cjg" previous reviewers were adzicnisned fcr inclusion 
requirements tnat were, m this author's opinion, too restrictive 
'e.g., inducing only double-blind placebo controlled studies;, 
tnis scnieirnat arbitrary line was drawn because cf a convection 
that witncut a control group, spontaneous s^.^rptor remission 
ra:::^ant m psychiatry would oe recorded as a drug effect. 
Case studies, experiential reports, pre-post designs, and 
drug versus drug stucies were therefore omitted. 

Tc idertify more clearly tne dor*ain from vnich tc sample, 
furtn^r restrictions were imposec on selection of potential 
studies. Stucies of patients vhose primary diagnosis was som.atic 
were excluded. Tnus omitted weje studies cf 'drugs used to treat 
patients for organic brairi syndrome, epilepsy, pnenylke tonuria, 
rj^nimaL b rain damage, o/T-own's SvT.drome, and stucies of 
patients wi th psycnopnys iclogica.1 « dis orders (asthma , backache , 
acne, ulcer, enuresis, ar.gina, ettf.). This criterion did 
not excluce studies vhose primary ' focus was exam.ination cf 
neurotic or psycnotic patients or patients' with character 
disorders whose somatization of symptoms led to phvsiologicai 
illness. ^ ^ 

All studies cf norm.al subjects and all studies that used 
only physiological outcomes ^e.g., blo^od plasma levels of 
am.ines, EES's, urinalysis; veH omitted. Lastly, studies of 
toxic psycr.osis (e.g., drug mducecj prychcsis) cr model 
psycnosis (e.g., using nallucmofcns ) vere not examined. 

A .Hedical Literature and Retrieval System (MEDLARS) 
nearer, from th# University cf Colorado Medical Center 
computer search facility generated all research meeting 
specified criteria catalogued berveen Januar>' 7, 1966 and 
Januar>- 30 , 1 9 77. (Tne search specifications appear in 
Taole 3.1,; Tne facility catalogues all stadies from approxi- 
r.atelv 2.^02 ^cumals. 



Studies cculd not be suppressed by design characterist 
or outcone H-ariaoles so though all listed studies met 
tne inclusion requirements mere was an unspecified 
nunber cf stipes listed that taet the exculsion require- 
ments as well (e.g., there were some uncontrolled studies 
and stucies cesigned to assess only b io-chetnical outcomes 
cf drug adniinistration) . Approximately 1,100 studies were 
located oy the KZDLARS, search. 

Several studies were, selected at randorr. fror: the 
^KEDL/.PvS print-outs. As tne referenced articles were 
Iccatfec ar.d read, it becanie clear that niany studies 
lacked, control groups. Titles ccntainng no allusion 
tc tne existence cf a control group (via such key words 
as "couble-blmd," ''crossover," "controlled," or "placebo" 
ocrtencec studies lacKing tnis crucial .ingrecient , Tnere- 
fcre, :c reduce reference retrieval time by cirectmg 
gctnering efforts ro^'arc studies ver>' likely to have 
centre^ grouns, articles vith titles containing the 
ao overmen ticned key words became tne primary focus of 
tne rancor, sar.cle. Forty sucn s^tudies were randondv 
chosen frj:^ tne MZDLA?^ b iblio^raphy . 

rrcr the psychopharr;accloglcal literature prior to- 
Januar:-- 1, 1966, the period act ;:overed by MEDLARS, a 
rancor sar.cle of about fifty studies was taKen frodf 
Dib liograz^hies of comprehen.sive^ review articles on the 
efficacy cf drug treatment in psychiatric cases and frosi 
studies listed in Psvcnclogical Apstracts betveen^^95^ 
and 1966 unaer the heading Tnerapy /Drugs Tiiese review 
articles and the numoer cf bibliographical references 
race m each are presented m Table 3.2. Shown as the 
last ref^frence m Table 3.2 is tne number of studies 
sampled fror the 195^-1956 Psychcldgjcal Abstracts that 
became part cf the pool of pre-1966 references from which 
studies were sampled. 

Once again the emphasis on title tern:inolog:y chat was 
lively to indicate the use of a control group was -applied 
to selection of studies fro^;. the&# bibliographies .* 

The selection of the ninety or^'so articles (fiftv 
articles fror- the 195A to 1956 literature; forty articles 
fror tne 1967 tc 1977 literature; was stratified so 
tnat app^rcxiraately equal numbers would t?e represented 
m tnree major drug categories: antipsychotic, anti- 
anxiety, and antidepressant. Once these articles were 
assembled a few articles were added tc assure that najor 
well-known studies and very recently published articles 
'rebruar>- and March , ^.1 977) were not overlooked, Kinety- 
six articles or oooks studying the effects of drug therapy 
were tnus ccl^gicted, read and coded." 

(Miller, 1976, pp. 31-36 
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b) Mi nor "ranqui ' i zer 

c) Anti-?sycnoti c 
Kieman and Ccle '£5j Anti-Depression - 
Morris and BezK [7C] _ ^^^ti- Depress ion 
Sneard, M. (7-5) Anti-Agress ion 

■ Psychclocica' Abs't-acts Therapy/Drugs 
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341 

1£5 
60 
25 



2,953 
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Miller's example has been reported here in Earner more detail 
tnan cay seem polite to the reader ^o make a point. Documenting trre 
methods used in finding research literature takes more space than 
■custom traditionally allocates to describing one's search. Hov one - " 
searches determnes what one finds; and what one finds is the basis ' 

of the conclusior^s of one ' s "integration of studies. Searches should 

r 

De more carefully done and document^ed than is- customary . 
The Landscape of Literat'jre 

Scnolarly, empirical literature in the social sciences and- 
applied fields can jie found in either primary or secondafy sources. 

By primary sources is meant the archival periodical literature 

"the journals," hund;reds, perhaps thousands, of then tro-x all over the 
world, ^ssertations and theses are also regarded as primary sources, 
as well as "fugitive" literatures . of' government reports, papers from 
scholarly meetings, reports to foundations, public agencies and the like. 

Secondary sources cite, review and^ organize the material 
of the primary sources; they include review periodicals (e.g.. Psycho^ 
loglral Bulletin, Review of ^ucational Research^ S ociological Review ), 
periodocal reviews ( Encyclopedia of the Social Sciences . Encyclopedia 
of Educational Research), and various abstract and citation archives. 

Abstracts in An thropology^ 

Child Development Abstracts and Bibliography ^ 
Current Index to Journals in Education 
Dissertation Abstracts International ^ 
Education Index 

Government Reports Announcements 6 Index 
Index Medicus 

I ncl#x of Economic Articles 
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Ir.ceragency Panel Inf orrr.acior. Systen- 
Intemar lor.al Bibliography o: Economics ^ 
International Bibliography of Political Science 
International Political Science Abstracts 
^ * Journal of Economic Literature ^ 
i.ibrar>' of Congress Catalog 

National Clearinghouse for Mental Health Information 
National Institute for Mental Health Grants and Contracts 

Information System 
National Technical Information Service 
Psychological Abstracts 
Research in Education 

Smithsonian Science Information Exchange - $ 

Sociological Abstracts 

Seme systems are computerized and quite scpr.isticated . For example, 

0 

the Educational Resources Information Center operated by the National 
Institute o*f Education is a remarkaDle service that not only indexes 
an^d abstracts the published literature in education (see Current Index 
to Journals in Education ) but the fugitive literature a$ well (see 
Resources in Educetion )., More significantly,. ERIC is a system 
organized around a thesaurus o^ topic descriptors assigned by ^ * ^ 
escperienced staffs of readers of the documents; this feature represents 
a significant advance over indexes that depend on author selected 
descriptors or the key words of titles. 

Perhaps we have said enough at this level. The rea^r who 
has gotten this far is unlikely to be a stranger to modern -libraries 
and the delights that they hold^ And the technology of information* 
stcrage and retrieval is advancing so rapidly that whatever detail ^ 
we might giye here is likely soon to be*out of date. 
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LiteVai^re Searches in Me tfe-analvses 

» ■ ■ "" 7 




'Out topic is ^he methodology ^ n)^ ,;>^"gna.lysis , so in the ^ 

reminder of this chapter we shall .limit 'cx:::selve9. to a couple^ 

^' > 

of considerations aboui literature searching that bear ^directly on 
meta-anXlvsis . 



Reliabilirv cf Litera ture Searches 

'* *• m ^ ' " 

No natter hov ar.bitious and sophisticated are cjjje's efforts^ 

to find all en:p^rical research or a topic, th^ .aspiration to find every- 

^ thing mu&t b^ inevitably 'frustra^d . There is sinply ?oq ^juch literature 

in too many strange 'places to find it all. But reviewers can do a 

* ^ * 

bett^ job than they tvpi.cally have done* The arbitrary exclusion of 
^ ' vast amounts of ^litecature' (e^g. , excludinj^i.1 dissertations or all 

fugitive manuscripts" in ERIC) is unsound and bespeaks more faintness of 

* . . r ' ' 

* ^ heart than intellig^ce of judgment. N.ever the less , "the most' conscientious 

• f ' * ' , 

efforts fall short of perfect . .<'Theflre is I^sb 'Reliability in seaj-chin^ * * - 

fer research ^^udA^^ thadn Uould be tolerable in survey research, for 
example; but it is an especially intransigent sort o:^ unreliability 



for which We' have no fac^e ansver^, 



■ ' We tested trhe reliability pf four lar^e study indexes by computer-' 

ized search on* de^scjipt^rs . f or ''group homes^ f or 'delintju-ents . " The * 
f our » indexes' were ERIC /Educational- R"ftources Infofmation Center)*^ 

PsvcholoRical Abstracts , Dissertation Abstracts , and Council fo^ 

' ^' ' ^ ^ ^ 

'Exce:)ticnal Childi^en Abstracts. A total (if 27 diff erent * studies were 
^ ^ — ^ ' : T — € . • ' 



ill]'. But they vere . distributed according to^the following cross- 
classif fnation. ^ ^ . € 



\ 



Search on; 



\ KuTT.bers' of Listings frorc 
Different Datra Bases 

(Achievenjent (v) Place) and /Teaching 
(group homes for delinquents) 



Family) 



ERIC 

PSYCHOLOGICAL 

"dissert ATI OKS 
ASSTRACTS 

CEC 

ABSTRACTS 



E?.IC 

8 
2 



PSYCHOLOGICAL 
ABSTRACTS 

2 ' . 
22 

2 



DISSERTATION 
."ABSTRACTS. 



•* CEC 

ABSTRACTS 



9, 



UNIQUE 



For example, of 8 studies on the topicwfound in the ERIC ^ystea, 
two were als6 listed in Psycholocical Abstracts , and thi^ee also appeared, 
in ^e CEC Abstract.s .' 'Five^ of the 8 ERIC studies did not appear in any 
of the other three indexes. The greatest proportion ftf redundancy 

appears to be betveen Psychological Abstracts and C£C Abstracts or this 

* rJ . 

topic. Tne ^bove table gives qne pause. Perhaps the social and behavipral 

sciences need in^iexes of indexes! 

Publicacion Bias and i-leta-analysis 

St , 

Meta-analyses may be'thoJght of as a type of survey research. 
•The. goal of- tbe tneta-analyst should be to provide an accurate, impartial, 



quantitative description of the findings in a popuiation o£ studies on a 
particular topic. This r.ay be done by exhausting the pppulation or 
sampling represenpatively frorr,. it^No survey would be considered valid 



1 ■ ♦ , 

It- a sizable subse: (or stratum) of the popmation was not represented 
iij the^ cunulativ« 'results. Neither should a meta-analysis be considered 
^ <r complete 'if a subse: of its population is otLitted. One very i^fportant 
subset of evidence is the subseTt of unpublished studies. To omit 

4 

dissertations and fugitive research is to assume that the direction" and 
magnituoe cf ef fee: is the same in published an^ unt>uDlished works. 

Tne niosr r'&drcal criti*ciSTL cf the assuniption of equivalence Is 
^( -he eld sav :ha: the published Lit^aa;ure only represent: the five 

percen: cf false positives in a population of studies vherem the 
^ nu^^ nypotnesis is true, ^na: is, tne puD-.isned stratum and the unpub- ' 

lished stratum: nave opposite average effects, and a meta-analy'sis contain- 
ing only published studies ^ould be vholely ^unrepresentative of the 
population. Rosenthal (1979) effectively countered this attack by 

--^'x ; ' ' • - 

— ) mathematical deruonstration of the numb:ers of studies which would have 

»t ' . , ' ^ ' 

been languishing in file drawers to make'-up^the 95 percent nul^results. 
The existence of^such huge numbers is considered implausible. '* 

The results of meta-analyses which did represeht both published/ 
and unpublished literature proa^'ide^ further e\udenc^e on the assumption' ^ 

* • w 

of equivalence. Table 3.3 contains the results, of 12 such meta-analyses. 

• * * 

In every ope of the ten instances in vhich the comparison can be made ? 
the average^ experimenpal effects ^rom studies published in journals * 
Xs larger than .th^^fcorrespoiiding effect c^stimated from theses 'and 
dissertations . That is, if one integrates only "pu^lishecy-t^fteaning 
journal published) studies, the impression of support for the favo'red - 
hypothesis is artificially enhanced over vhat* would be seen if the entire 
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' Table -3.3 

Releiiftnsmp Between Source of Publ i cati oj^, and Findings 
in ] 2 * Me ta- Analyses cf £xpenmgnial Literatures 







• 




Source 


of Pub) ication 




'ivest-igato'-(s , 


Topi c 




Journal 


Book 


Thes i s 


Unpubl . 




a vale ( "79; 


• Psycho'. T ncji sti c 




> 

13 

' .50 




16 
• 

.30 


5 

.■37. 






* 


n : 
ES. : 


.35 




13 

♦ .2S 


34 
.54 




• 


Tutoring 


n: 
EST: 


0 

.77 




.CO' 


17 

1 nt 




osentnal ('76/ 


Exne'-^errter 


n: 

. ES.: 


1.02 




50 
- .74 




■ 


nftn i^Oi) ' 


Sex bias in 
psychotherapy 


ES. : 


28 
.22' 




3-2 

r.24 


§ 




Jiith i'BQb] 
— ■- 


Effects o/^ * ^ 
■ aesthetics ecluc. 
on basic ski 1 15 


n ; 

n.: 


29. 
1.08 




\ 154 
. .48' 


56 
.50 


- 


rlberg ('79) 


Spec. erl. room 
placcmppt 
vs. re^. roc^ 
placement ' 


n: 
ES. : 


146 
-.09 ' 


17 
-.01 


45 

* -.is' 


..^14. 

-.14; 

1 


\ 




Resource room 
plac. vs reg/ 
roorr place. 


n : 
ES. : 


33 

-.11 . 


6 
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'Taple 3.3 (conrinur.edj 



Her ('79j 


Drug" tiie^cpy 
of psych, 
ci soro^rs 


n: 


336 




21 

.00 


V 




sarold ('79)'' ' 

1 


"Effect- of TV 
C" arti-socTal 

K o K A V 

« 


• . 


252*. 


> 


120 


96 > 

. 1 0 


^,3 


subtotals! 




r: 


1C25 




1"77 ■ 


473 


258 












. IE 






rr.- :r , Glass I " 


^sychctnerapy 


r * 
ES. ; 


1175 
.57 




42 - 

.80 ^ 


4E3 - 
. 65 > 


61 
•1 . 95 


TCTAlS 




' n : 


■ 2204 




215 


956 . 


329 


• 




£.0. : 


- ,64 


9 


. 30 ' • 


.48 


.58 
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iirerature w,je_ iMegratec (i.e., journals-, books and dissertatioas) . 
The bias ir. the journal literature relative to the bias In the disser- 
tation literature is not in^onsideTable . The n&an effect size for journal 

r ■ 1, 

IS .6^ as coTsoared with .uS for the dissertation literature; hence, the 

tias IS of tne order cf ,{.bL - .48)/.Z.8: lOO: - , findings 

reported ic : ournals^, are , or. the average, one-third standard deviation 
I 

more favorably disposed toward the favored hypotheses of tne investigators 
tr.ar. rindir.g^ reported m theses or dissertations. 

-oT^parisons c^^^^a^rage eff-ect sizes anong ct.her sources cf 

puDlicatior. are less clear, m .pirt pernapg,, because of the a^ibguity J. 

. f 

.ase^s sue." as "urpublis.ied" or "book." In four of eight instances, 
the average effect size for journal? was larger than for unpublished 
Studies. ' ^npuPlish€<J--stt«:ies seenied to divide along the following lin4s: 
.one large group of old unpiislished studies, containing unremarkable 
■ici.ul'.b '.hir-Tiever cAiTgHrinyine ' s""7tt entlon",' "arid "a STnalie7^'roup""of ~~ 
new studies circulating through the "invisible college" while waiting ^ 
be pubj.ished. ^ * 

In. the meta-analysis of sex bias in counsel4.ng and psychotherapy 
(S=:ich, 1980a), not only the magnitude but the, direction cf effect was 
different in published and unpublishe.d studies, k positive effeot s'^e 
indicated the biasing effect of counselor attitudes, judgments, and* 
behaviors against female clients or against non-stereotyped ^oles for 
.eaa^es. The effect size from pubflished studies w^s\22, demonstrating 
counselor bias against females. The effect size from unpublished studies 
wis -.2- demonstrating counselor bias in 'favor of females. ^ 
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Fror rhese data it is appropriate tc conclude that fail^g'to 
represent unpublished studies ir. a meta-analysis may produce misleading 
generalizations. 

To omit dissertations because cf their assumed lack of rigor is 
also ^unwarranted. Only after the studies have been quantified' and their 
results transformed tc effect size measures can it be determined whether 
published, studies on S*^topic were mere rigorously designed than were * 
^ unpublisned studies and whether rigcr cf design related to magnitude of 
effect. In tne psychotherapy meta-analvsis (Smith, Glass, and Mi'ler, 
1980;,- ther^ was nc reliable difference in the rigor of design. of 

put^i^ned versus unpublished studies. ,In the sex-bia^ meta-ar\alvsis 

I' 

(Siaith, :989b), the published studies that show§c "tia^ 'against females 
actually had< less rigorous designs than 'did studies Ceithe)- published or 
unpublished) which showed no 'trias against females. 

To make these decisions a priori may iniect arbitrariness^ and bias 
mtc tne^'Ton^lusions. If meta-analysis offers any improvement over 
tracitional metnods of reviewing research, it is precisely in the area 
cf removing these sources of arbitrariness to arrive at an impartial 
and represeHative view of "wh-at the research savs." ^ 
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CHAPTER FOUR 



DESCRIBING, CIASSmiNG AKD 
CODING RESEARCH STUDIES 



Meta-analysis is zhe sraristical analysis o: research which 
wcri-is wirh research reports as its rav tnate^rial. Thus, meta-analysis 
entails trve quantitative description of the characteristics;and findir,gs 
cf studies; tnis cuantif ication ' usuallv ^mvolves oeasuretnjsnt m its. 

A ■ ^ 

netric a^^cts v^.g., m v.-^at vear was this studv cone? VTnat is the 



saisple size on which 'r I's based?) as well \s its noTt^nal or coding ^- 
function (were initial differences corrected by analysis of covariance 
? Yes « i, -Iko « 2). Since m£ta-anal>'sxs entails .the tDeasureinent of 
study cnaracteristics and findings, nany concerjjs that apply to neasure- 
Tuentimore generallv (e.g., reliabilitv, valicitv) applv co measurement 
as ^plied in meta-analysis. 

Consider the example in Table ^.l.t There ar^ -recorded the 



characteristics and findings of aboyt twentv correlational studies of the 
relationship t>etwec;i teachers' "indirect" tiach^ng style (non-authoritarian, 
encouraging discussion instead of lecturing) ' and pupils' learning. For 
exanpie. in Study fl3 (Torranoe and Parent^ 1966) ,^ the -incfircctness of 
ten teachys' style was correlated with their pupils' ntethematics 
^achievejDeni for a year-long c^rs% at, the iiigh-school level; the data 
were reported in the forrr of a Spearman rank*order corrlslation coefficient, 
which IS itself the iJist estimate of the Pearson The reported 
l^rrclation of teacher indirectness and pupil acHievetnent* was -^.32, 
pupils of more indirect teachers learned inore math. 
J' On tne face of ^e probler., there are' six \5jariaDles or characteristics 



aescriprive. cf eacr studv: :he nuziber cf teachers studied (the sample 
size, ir. effect), the deration of t:?ie period cf instruction', tne .siJbject* 
te|ted, the grade-level of ^he pupils, the fonr. of the originally reported 
findings (r_' s , J_' s , etc.), and an estimace of :he correlation on the 
Pearson r^ scale. If one proDes deeper, even niore characteristics of * 
studies are apparent or car. be infer:;,ec fron the research reports. rcr 
exarple, the' year m vh^ipthe study vas ::eported appears in Table ^.l 



ancicould be ar. ir.teressmg prc^ertv cf ^studies m a fiel_ 

tc fads and trends. The identity cf the researcr.er is knovm, and some- 

times ether cnaractersi tics can be inferred frors sucr knovlecge, e.g., 

Ha^ this researcner cone several studies cr .cnl^ one? Has he taken- 

a public iTl on on vr.at this research ought to show? how rzanv of the 

\ 

4 

researchers are related as nentor-to-student or cclleague-to-colleague? 
Moreover, variables that appear si^.tle and straightforward reveal 
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une^epected coriplications after a closer look. TaVe , foe instance, "grade 

level" in Table ^.1. A study in which X and Y are correlated fbr ji 

students and teacners are spread «ong several grades* (across fourth, 

fiftn and sixth but averaging grades five, say)* It nay be necessary, 

t^hen, to code both the average cr modal grad^ of pupils represented and 

the range of grades as separate characteristics of the studies. Measure-' 

♦ 

TBfnt of study finding^ is likewise corcplfex. It is necessary to transform 
tr»e findings cf each study to a cbaQon scale of Pearson's r'so that ♦ 

V 

comparisons and contrasts can made; but s.tudies come reported in ,a ^ 

i 

bewildering variety cf odd statistics. For example, ir. S^udy f^ll, Weber 
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Results of Studies on 



the Relationship Between Tear,her Indirectness cvid Pupil Acfii everrp^nt 
'(After Gage, 1976) , 



No. of Duration of 





Study 


Teachers 


Teach i no 


1 


r 1 anrJer s [ i j /U } 


15 


2 semesters 


2. 


ng^dors (fgzo) 


^ 16 


2 wee^s 


3. 


Flanders (1970) 


30 


2 semes ters 


4 


F landers ( 1970) 


15 


2 weeks 




Flanders (1970) 


16 


2 weeks 


6 


Cook .( 19G7) 


8 




7 

» 


Fuust (19G7) 


15 


4 one-hour 
lessons 


n 


flpdlpy-Mi tzel 
(1959) 


19 


2 semesters 


9.^ 


■Powpl 1 ( 1968) 


9 


2 semesters 



10. Snider (1966) 



11. Weber ( 1968)- 



17 
(no in 
analysis ) 



12. Thompson i 

Bowers (1968) 

V 

13.4 Jorrance- 

Parent (1966) 



15 



10 



2 semesters 

3 yeav-s 

2 semesters 

2 5C\mesters 



Learn ing 
Teste.d 

Sub jpct 



Grade - 
Level 



Reported Statistics 



language skills 
number skills 



,073 



Verbal Fluency 

Wor^ Meaning 
Social Studies 

Mathematics 



df = 1,176 

4 r<i';F = 2.o 

df - M3 
7«12 rho = .32 



fqui valent 
Value in 
Terms of 
r 



-.073 



social s tud i es 




r ^ .308 


. 308 


Compos i te MAT 


6 


r ^ .221 '■'^^ 


.221 


Social Studies 


7 » 


r - .181 


■ .181 


Mathematics 


8 


r - .128 


.428 


Discussion - Lab Work 


1 n 
10 




.09 , 




• 




.07 


Economics 


10, 12 




.11 








Reading 


3-G 


r - .20 ' 


.20 


Composite SRA 


3 
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divided the 26 teachers into two groups (above and below average or. 

♦ 

"indirectness") and then perfonaed an analysis of variance F-lest or. * 
their pupils' creative thinking test scores. Transforming the -resulting * 
?-ratio into an^equivalent measure of £ took soiue ^statistical magic; 
hence, the forrn cf the translation and its assumptions are 'character- 
istics of the studies that could be coded. 

The point cf this nieasurenient and coding of study characteristics 
IS to relate tne properties the studies (their subjects , investigators, 
technical qualities and the like) to the study findings. For example, 
by comparing the r_'s for studies done a: the elernentary (K-5) and ' 
secondary (7-12) levels in Table A.i, we were able to discover that 
the correlation between teach^ indirectness and pupils' learning is highe 

0 

Cf • .30 base^on eignt cases)' at the secondary level than at the 
el^entary level » .16 base^i on ten cases), perhaps because young 
pupils need more direction or perhaps because lecturing style is le'ss 
relevant in earlier grades CGlass al ,\ 1968). 

Tne example of a meta-analysis cf teacher indirectness was ' 
rather long; we hope that it helped make the point tliat the measureiiSent 
of study characteristics and findings requires .ingenuity and care in 
the definition of properties of studies and their quantification. . 

GENERAL CONSIDERATIONS 

Measurement, of study characteristics and findings can be evaluated 
viLh 'respect to both i.ts v^dity and reliability, as are other instances 
of Tueasurertjent . . 

D5 ' * , 
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Validity , The ^validity of measuring stu^ properties and findings is 
a very broad consideration. Most things that bear on the meaning of 
a coded or measured characteristic ^re matters of validity- These 
considerations include such things "as clarity of definitions, adequacy 
of reported information, ^the degree of inference a coder must make in 

"^determining from the vnritten report what characterrzed the research, 

•i 

' and the l£%.e. Some problems of validity can be corrected by greater ^ j 
care in reading and coding • studies : making definitions sharper and more 
detailed, splitting broad concepts into more refined ones. Other problems 

\ » ■ ■ , 

of validity cannot easily be corrected: one must infer that in a 
particular stud^ the assignment of subjects to experimental conditions 
was ^on-random because random assignment was not specified and there 
are 'significant differences on most pretest variables. There probably 
arfen't any useful general technical guidelines for making study measure- 
©ent more valid. Examples may have to , substitute fpr prAciples. 

Consider a somewhat extreme example of lieasurement of study 
characteristics that was pursued with more than normal' care for the. sake 
of the validity of the measurement. Smith and^Gia^ (1977) performed 



/ a meta-analysis of nearly four hundred controlled experiments 



on 



p'sychothera|)y outcomes. One characteristic of studies that was of 
principal interest was th^e tjrpe of psychotherapy being evaluated (e.g., 
Rogerian, ,Adlerian, behavioral, etc.). Even the simple labeling of the ^ 
psychotherapy in a single study grew unexpectedly difficult at times. 
Could a psychotherapy described as ''non-directive reflection of feeling 
plus empathic understanding'' be properly coded as Rogerian in the * 
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absence of the investigator's having labeled it Rogerian or otherwise 
referred to Carl Rogers? Yes, it pVobably was safe to do so. But 
what^ of tougher cases? . SuiJ^ose an investigator reported a study in 
whicit he compared "psychotherapy" against a wait-list control group; 
rather than naming the specific t>'pe of psychotherapy he merely referred 
to the therapists attempts "ro interpret clients' defense mechanisms 
and help them gain insight into the causes of their difficulties." 
Is it safe-to assume that the therapy was psychoanalytic psychotherapy 
and code ir as such in the meta-analysis? Or would it be more prudent 
to classify the therapy as "eclectic insight xtherapy"? There's no 
general answer since questions at this level wou be resolved bv 
particular considerations of purposes we haven' t specif ied . the examples 
tnerely illi^trate %he, complexities of defining and' recognizing qualities 
(requisites of measurement) of studies from vritten reports. 

In our work on psychotherapy outcomes, complexities of measure- 
ment (or classification) were encountered again at a more general level. 

More than twenty specific types of psychotherapy appeared in the nearly 
-J 

AOO experiments. These twenty were fairly easily grouped into ten more 
general type^of .psychotherapy: Rogerian, Gestalt, Rational-emotive, 
Transactional Analysis, Adlerian, Freudian, Psychoanalytic Psychotherapy, 
Behavioral Modification, Systematic Desensitization , and Implosion. ^ It 
was deemed worthwhile to, attempt to group these ten, psychotherapies into 
a small number of more general class so as to address additional quesXions 
in the meta-analysis. But questions remained about how this grouping 
might best be done. On the basis of what evidence or what process of 
judgment would therapies A, B and C be deemed to belong to Therapy Class I 
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and therapies D ang^ E to Therapy Class II? In a general sense, the 
question was one of meas'urement validity, even if measurement in this 
izjstanfce was only classification and coding. Perhaps the least valid 
grouping of ther^ies into homogeneous classes would have been 'based 
on our own unexplained judgmebt of which therapies were similar to which 
others. Instead, we enlisted the help -of about twenry-five clinicians 
and counselors. For about ten hours we studied and discussed the theory 
and techniques or each 'of the ten psychotherapies . Then the therapists 
gave their rankings of the similarities among the psychotherapies using 
the method of multi-dimensional rank-ordering (Torgerson, 1958). 
The therapists' similarity judgments were th^n^ subjected to analysis by 
multi-dimensional scaling (Shepard, 1962). A graphic representation 
resulted of the therapists' pAceptions of the similarities among the 
ten psychotherapies (see Figure A.l). In the three-dimensional space in 
Figure ^.1, the distance between two therapies (represented by black 
circles) is inverself' related to the similarity between the therapies in 
the perceptual space of the judges (therapists). The four amoeba-like 
figures in Figure A.l connect therapies that are near each otfner ;Ln the 
space. Thus, Rogeriari and Gestalt therapies form a class of psycho- 
therapies, as do Rational-emotjive and Transactional Analysis." In this 
manner » four classes of psychotherapies were derived,, and they were 
derived so as to reduce the influence of arbitrariness and idosyncrasy — 
thus, one hopes they represent a more valid classification (measurement) 
of studies than might otherwise have been done. 
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Figure 6.1. Multldlmensionnl selling of t^n pqycfiotlinrnples. 
by 25 cllnlrlnns hu<\ r.ounsplom 
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ReJLiabilirv , Reliability in rhe generic sense of c'he word 
refers to consistency of measurement. What is ,the extent of agreement, 
^ong different measurements of the same thing? There exist ^any 
alternative ^ways inVhich the measurements to be compared for agreement 
may be different. For exa&ple, in the familiar instances of the reliability 
of measurement of human behavior the most prominent source cf different 
measurements are temporal variations in the behavior itself. A 
psycnclogist may vls^ tc measure peoples^ mood cn a *scale cf "happy. ^ . 
sac." He may use tne same f if tv-cuestion standard inventorv with eacn 
measurement so tnat* different scores could no: arise fro;:, some ' ^ 

V • • • 

instability in the more mechanical aspect of t^e testing; .but he may 
discover nonetheless that he obtains relatively inconsistent scores 
for persons because t.heir moods are fleeting: happy in the morning, 
apathetic by lunch, melanchciy by evening. If the psychologist chose 
instead to measure mood by clinical inter\^iev, the potential scources 
of unreliability might -multiply: instability in peoples' moods across 
time, differences among questions posed by interviewers f rom *one 
occasion -to 'the next, differentes in the star^ards of judgment employed 
by- the interviewers, and the like. - Cronbach and his colleagues have 
brought psychometrics around to the notion that the question of measurement 
reliability is basically the question of the xelative contribution to * 
inconsistency of measurement of multiple soujxfes of differences among the 
conditions of measurement (Cron^ch, Glfeser, Nanda and Rajaratnam, 1971). 
This point of view helps one think more clearly about problems of 
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measurement reliability that arise in research meta-analysis-. 

The measureTQent problem; in neta-analysis is the probleic of 
measuring (quantifying, classifying, coding) the characteristics and 
findings of s:;ucies based on vritten reverts. That the thi-ng measured 
is a vritten report that cannot change froir. one day to the next \ 
for sp:.rit "ditto" copies tnat eventually fade into illeg^bi^^ity? ) 
eliminates a izaior sour;;e of inconsistent fneasureraent that plagues 
neasurenent c: mdivicual or group actions. The pr^cncipal* source cf 
measurement unreliability :in ^meta-analyses ari^s fro::, different readers 
(coders; not seeing or judging characteristics of a itudy in the same . 
way. Judge-consistency or rat'er-agreement is the most ir.portant consider- 
ation for our purposes, * 

There is no total remedy for thef inconsistency that arises. among 
different coders of the same research study. Explicit instructions, 
specificity in definfng characteristics and Grundlichkeit will all 
help -reduce the problem somewhat, but there are limits to what can be 
\ specified befstre-the fact and how much detail can be imposed on coders 
before they quit. The guidelines we propose are 1) good sense and 
reasonable care at the outset., 2) assessment of the extent of disagreement 

by haying multiple judges, read set 6f common studies, and 3) correcti(jp 

\ 

of flagrant .inconsistericies discovered at step ^2, Step l72 is the 

important one;. all but the simplest -meta-analyses should be subjected 

to an assessment of the reliabilidj| (in the ra^er-agreement sense of the 
word) of the coding procedures. 

An example may hfelp clarify this recommendat>^ . In assessing 
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the comparative effects of drug vs. psychotherapy, Sciith, Glass and 
Miller tl980) developed an extensive coding sy&reilrVfor describing the 
characteristics and findings 'of 151 experiments c'olle^^aW from the 
literarure of psychopharmacology. To test the reliability of the^ 
coding, 2 judges were enlisted to. code 5 studies.. One judge coded 2 
studies, and one coded 3. The judges were ur.faitiliar with the psycho- 
phannacological literature, but well-practiced in general coding and 
effect-size calculation connncn in nieta-anaiysis'. " ' 

The 5 studies were included in the 151 studies gathered for ' . ' 

the meta-analysis. Each^udge received a drug-only study and a study 
cf drug-plus-psychpthe-apy . The studies were chosen at randoit froit all 
studies under ten pages in length. This restriction of length was 
adopted to reduce the time neces/ary for the judges to devote to the task. 
A brief list of coding conventions was given to each judge, with a. 
^ request to code only the effect size for one or two dependent variables 
if there were many from which to choose/ 

One hundred sixty-two ratings were recorded by the 2 judges over 
the 5 studies (not including the effect sizes themselves) and were . 

> 

matched with an equal number of ratings by a third judge. One hundred ' 
twenty-two (75 percent), were identical'" ami another 13 (8 percent) were 
%vithin one or two scale points for five-point rating scales or continuous 
variables such as patient age, duration of treatment, and the like. 
Seventeen percent of the ratings were placed into the wrong category \ 
or were off by more than two scale points.^ The^e incorrect codings 
included such inconsistencies as the rating of an outcome measurd^ as 



« 

hospital adjustment rather than work adjustment or as somatic symptoms 
instead of anxiety. The co*ciings of the two judges did not differ 
substantially from the codings of the third. 

Agreement between eaoh judge's calculation of effect sizes and 
an eariier independent calculation was substantial. A sixth study \^as 
added exclusively to give another test to the replicability of effect- 
sizfe calculation, rnis study was chosen to represent a relatively compl 
case for calculation. Calculated by the second judge, it is reported 
last in Table ^.2 below. 

*able ^.2 Effect sir^s for is^^ judges compared to those of a third judge 
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The ES's, effect sizes, referred to in Table A. 2 are mean 
differences divided by standard deviations, a measure of experimental 
outcome already encountered several times in this text. It may strike 
the reader as curious that in only one of six instances in Table A. 2 did 
the two judges make calculations of effect size that agreed through two 
decimal places. Be assured that the discrepancies (none terribly large 
and on the average quite small, viz., .07) do not seem surprising at all 
to us. As will be seen in Chapter V', although the definition of ES is 
very simple, its calculation in. particular instances can be extremely 
complex, frequently calling on complicated judgments about how to 
aggregate sources of variation, about when to make simplifying 



assumptions and when not 'to, 'and of ten 'ent^ailing arduous chains of 
calculations in which accuracy may be compromised by rounding off a 

six-digit answer to four digits at some intermediate stage. 

/' 

CHARACTERISTICS OF STUDIES 

Tne characteristics of studies taat are most important in a 

/ 

meta-analysis (apart from the findings, of course) can be roughly 
classified as either substantive or methodological > Substantive features 
are those characteristics of studies that are specific to the prtpblem 
studied, e.g., in a meta-analysis of drug treatment of hyperactivity the 
substantive characteristics might include.!) the ty^je of drug administered 
(caffeine, amphetamines, etc.), 2) the size of the dose, 3) the age of 
the subjects, U) the presence or absence of checks for ingestion, and 
so on. The methodological characteristics of studies are more general; 
they may be nearly the same for all meta-analyses of a general type, 
Such as experimental studies, correlational studies or surveys. They 
include a virtual table of contents of research methods books: 1) 
Example size, 2) test reliability, 3) randomization v. matching v. non- 
equivalent groups, U) degree of subject loss, 5) single-blind, double- 
blind or unblinded, and the like. 

The purpose underlying coding the substantive and methodological 
characteristics of studies is the same: one wants to learn whether the 
findings of the studies diffei: depending on certain of their character- 
istics. A^^meta-analysis seeks a ^11, meaningful statistical description 
of the findings of a collection of studies, and this goal typically 
entails not only a description of the findings in general but also a 
descript ion of how the findings vary from one type of study to the next. 

9« " iO.-, 



• An example naught clarify the use of both substantive and methodological 
study crtaracteristics in. this respect. 

In a meta-analysis of the relationship between school class-size 
iftnd pupil achievement, we coded nearly thirty substantive and methodological 
features of each study • including the findings, viz., the standardized 
average difference in achievement between the larger, L, and the smaller, 
S, class (Glass and Sr-.ith, 1979). The characteristics coded 'for each 
study included where the study .was published, in what country the 
research was performed, the date of publication, which subjects were 
taught to the pupils, and many others which can be seen in the facsimile 
of the coding sheet that is reproduced as Table- A. 3. -Using statistical' 
models that will be presented in Chapter V, the data from over 700 
comparisons of pupil achievement in smaller and larger classes were 
integrated into an aggregate curve descriptive of the relationship 
^as revealed by the empirical researcK^literature . But the analyses did ^ 
not stop there.. Many persons -feel that the nature of the relationship 
between class-size and learning may vary depending on what subject is 
taught (math learning may flourish in small classes, but not physical ' 
education, for example) or 'the age of the learners. Moreover, it is 
possible that a flaw in'^research methods (unreliable tests or improper 
statistical analysis, for example) obscures the class-size and achievement 
relationship in some studies. To check for these possibilities, we^ 
\nalyzed the class-size and achievement relatioifthip separately for 
various subdivisions of the data. For example, all studies involving 
pupils in grades kindergarten through six were separated from those done 
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Table A. 3. - / 

CLASS SIZE CODING SHEET 

t 

IDENTIFICATION: « 

1) Study ID#: - 2} Authors :__ * . 3) Year: 

4) Source of data: ^Journal Book ^Thesis ^Unpublished repc?rt 

5) Classiffcation of study: ^Class si2e ^Ability grouping Tutoring 

Psychol, experiment ^Secondary analysis 

6) Country of origin: ; . , . ^ 

NSTRUCTION: ^ 

h) Subject taught: Reading Math Language Ott/er: 
2] Duration of instruction: hrs, weeks 

3) Supplemental vs/ integral: Instruction supplemented other large group instruction, 

Instruction constituted entire teaching of the Subject, 

4) Adaptation of instruction tp class size: 

Type of instruction in smaller class: 



Type of instruction in larger class: 



Smaller Class Larger. CI ass 

5) No. of pupils:. ' 



6) No. of instructional groups 

7) No. of instructors: ' 

8) Pupil/instructor ratio: ' 

9) Accuracy of estimate of ratfo: Lo Av Hi ' • ^ Lo Av Hi 

10) ^ Instructor^ type: _Teachers ^AduU ai3es of tutors _Both 

11) Sex of teacher: _M _F i ' ' 

12) Years teaching experience: years 

CLASSROOM DEMOGRAPHICS: . 

1) Pupil ability: _I0 < 90 *_90 £ IQ < 110 _IQ > 110 

2) Percent pupils female: % \ 

3) Ages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

4) 'Average age: years 
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Table A. 3 (Continued) 



STUDY CONDITIONS: V 

1) Study setting: _Regular classroom _Experimental setting ' 

2) Assignment of Ss-to groups: _Randorr, _Matched "Repeated measures" ' 



/ 



Uncontrolled 



3) Assignment of instructors • to groups: ^Random , _Matched _"Repeated measures" 

Uncontrolled 

4) Percent attrition: Srrall. class: % Large class: % 

OUTCOME VARIABLE: , °, . 

1) Type of Outcome Variable: * . 



Standardized achievement test: 

_Ad hoc achievement test: .. _ ' ' 

_Pupil attitude :^ • 

Jeaching behavior:^ 



^Pupil^teacher interaction: 



^Teacher attitude or satisfaction: 

-2) Quantification of OutccAne: 
^Gain scores (sinvple) 

Residual ized gain scores * ' ' 

Uncorrected dependent variable * 

3) Congruence of instruction and outcome measure: Low Average High 

4) Follow-up time: ^weeks from the end of instruction to the Measurement of 

J| outcomes 

5) Standardized, mean differefqe (SmaTT-Large) ; 
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vith pupils in secondary school ( a substantive -charactei^istic) 
Tne statistical cup/« describing hov achievement is' related, to class- ' 
size was then derived* for ^each of these tvo parts of the data,i^ It so 
happened that thC two curves were nearly the same (\cithin statistical 
error) so that ^ere was no need^ to modify £he^^cfl|R:lusion of a class- 
sLze^B^Lathievement relationship for different age'-groups of- pupils, 

^ Bow^yfex, otie' metbodological characteristic of the studies was 

strongly related to our conclusions. Over- 100 comparisons of achieve- 

' ment in stnaller and larger classes came from studies* in which the » . 

, tnreat c:' pre-existing differences between classes was controlle^ 
by random assignment; to the - tVo ^lasses'^ the remainirfg comparisons 
came from studies ;in which poor contr6ls were achieved (efg., naturally 
occurring smaller and larger classes were* compared") . The studies were 
thus distinguished vith^respeCt t'o a characteristic of research method. 
When the statistical curves we,r^|ierived for these two parts of the 
data, q\aite a-different picture cmerged'from what was^ seen when 
elementary-grade and ^secondary-grade studies were compared. The graphs 
of the two curves ai|^ear in Fij^re A. 2. Not only what we said about tlie ^ 
class-size and achievement relationship but what we Cj^ncluded a^ut tii^ 

^trustworthj^riess of* research on the question, were affected by our dis^^ery 

-/ 

that th^ study findings varied as a function of metpodological character-' 

^% 

the studies themselves. 
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An Example of Study Coding 

In our meta' analysis of psychotherapy outcoTiie experiments 
CSmith, Glass and Mille:;, 1980), we developed a long list of substailtive 
and methodological characteristics for describing thyresearch literature 
The numeric codi^ of each study extended across hearly three computer 
cards - 211 digits of coding in all. A facsimile of^ the coding sheet 
appears as Appendix A. It Contains the following variables on which 
each^' study was classified: date of publication; form of publication- 
professional affiliation of the experimenter; the degree of blinding 
used in the study; whether more than one treatment was simultaneously 
compared against the control group client diagnosis; previous hospitali- 
zation; intelligence;'*age; sex; 'similarity of client to the therapist; 
the means by which the clients were obtained for the study; means of 
assigning clients and therapists to comparison groups; mortality '( loss 
of subjects/ from samples; internal validity of the study; the type, 
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duration^ modaliry, and location of the treatment; sample size; 
therapist experience; type and reactivity of outcome measure and the 
time after therapy when^it was 'measured; whether factorial effects 
were tested; and the statistical procedures for determiniTig the size 
of effect produced by rhg therapy. Each variable is further described 
^ below. *^ 

Each study was repfl end a coding fonc was ccmpleted \oz each 
/ ouwcoine and eacn comparison $in tne stucy. This task presented a ' 

range of - difficulty depending on the clarity of the research report 
and the confonnity of the ex^rimenter to stanArd research practices. 
A list of coding cdnventions was developed during the pilot phase of 

^ 

the project and was used to guide th^ classification of studies 
•u whose characteristics were ambiguous. These conventions are explained 
in the following paragraphs. 

Date of Publication . This was recorded as stated on the 
manuscript. Sane studies were" pu|)lished more than once, and 
ip this case the eai^list date was recorded. 

^ Fom: of Publication . The study was classified according 
to the form in wt?fch it appeared: journal article, book, 
dissertation, or .unpublished manuscript . If more than one 
form was used, such as a dissertation later published in a 
journal, the stu<ty was designated in its most accessible form. 

' Professional Affiliation of Experimenter . The study- was 
1 

classified, according to the affiliation of the experimenter, 
as either psychology, education* psychiatry, social work, or 

er|c ' 



"other." -This class^if icacion was deteminad' by the institutional 
and deparriDental identification on the manuscript, or by member- 
ship in Che American Psychological Association. " ^ 

Blinding of Experimenter . Thj.s variable represents the > 
degree of blinding that prevails in the assessment- rff outcomes or 
in the admini^ttation of these in the study. If the experimenter 
or; the outcome evaluator was^ kept uninformed about whether each 
subject was in the control group or the treated group, the stu^Jy 
was classified as "single blind/' If no information was provided 
that showed that the experimenter. or evaluator was* kept uninformed 
atbout group composition, the study was categorized as either 
"experimenter did the therapy" or "experimenter knew the composition - 
of the groups but didn*t personally treat the client." 

Client Diagnosis . In the meta-analysis, the diagnostic 
label that the experimenter used vas recorded and classified 
into a twelve-category diagnostic system. The categories were 
(1) neurotic or ^true (complex) phobic, (2) simple (monosymptomatic) 
phobic, (3) psychotic, (A) normal, (5) character disordered, 
(6) delinquent or felon, (7) habituee (e.g., alcohol, tobacco, 
drug addiction), (8) emotional- soma tic disordered, (9) handicapped 
(physically or mentally), (10) depressive, (11) mixed diagnoses, 
and (12) unknown. ^ J 



Hospitalization . The number of years of {previous hospitali- 
zation, ^as stated or implied by the author, was another indication 
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of the severity of client distress and was recorded. 

• •?> 
Ir^^elligence . Intelligence of the client 'i$ f requAtly 

cited as ujfedi^ting the effects of psychotherapy. The intelligence 

of the group was rated as '*belaw average" for IQ scores less 

than 95T<.average" for IQ scores between 95 and 105, and "above, / 

average" for 10 scores ^faove 1^5. The^ source of information, 

about client intelligence was also recorded. In U percent of 

the studies, IQ was reported by the ^e,>t^ref imenter . In 61 percent 

cf the studies, IQ could be inferred (at least vith the accuracy 

necessary to niak'e the three gross distinctions) from the client's 

placement in some^ institution, such as^ a college or a treatment 

facility for the mentally retarded. In 35 percent of the cases, 

client intelligence could not be assessed from the report and 

therefore was estimated as average. 

Client-'y^erapist Similarity . The socioeconomic and ethnic 
similarity between client and therapist is also thought to 
influence the outcome" of therapy. The cultures of the therapist 
and t^e client are similar in the sense that they share 'common 
languages, value systems, and educational backgrounds. Th^ 
mbre healthy the , client , the more he resembles the therapist. 
The studies were rated for similarity between the- client and the 

y 

typical white, middle-class, well-educated therapist. The 
highest value (A) was used for studies (5f white, miUlle-class 
wGll-educaced, and mildly or moderately distressed clients. The 
lowest value (1) was used when. the typical therapist treated 
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lower-class ninoriry or severely disturbed clients. , 

t 

Solicitation of Clients : The use of volunteers in therapy 

■ r 

studies has been sufficient cause for scnue previous reviewers 

to disallov these studies as t«sts of therapeutic effect. Tet ' 

in the case of nost analogue studies, the volunteers^ reported 

svTnptoins, requested and were given psychological treatments to 

remedy ther. It is possible that they differ only m degree fron 

"real" clients vho independently seek treatment. The studies ^ 

were classified according to whether (1) the subjects were 

solicited ^for therapy by the experimenter (usually by offering 

treatment to psychology students who obtained extreme scores on 

anxiety measures); (2) the Subjects came to ^ the treatment program 
« 

in response to an advertisement; (3) the subjects recognized ^the 
existence of. a problem and sought treatment; {L) the subjects 
were referred for treatment; or (5) the subjects were committed 
to the treatment, with no choice. 

■Assignment to Groups . A characteristic often afforded most 
importance in judging the validity of a comparative study is how 
the experimenter allocated subjects to treated and control 
groups. Random ^assignment insures, within probability limits, 
that the two groups are initially ■ comparable and that differences 
between them oti the post-rest are /4^tt rib u table either to 
chance (with probability equal to the significance level) or 
tio the treatment and to no other source*of influence. Matching 
pairs of subjects is the next tjest method, although using it 



presumes that.^all sources of influence on* therapy are known 
and cai^ be used as^ftiatc^ing variables. ^ 

/ ' ' * ' Moreover , 

It readers significance levels meaningless when ca^lculated in 
th^ 'usual ways. Ex post f^cto matching, covariance. adjustments, 
and ecua'ting on pretest scores are less satisfactory allocation 
methods, but still 'better than no matching at all/ S^t^ies were 
classified according *tc the assignment of both clients and 
therapists to groups. 

Experimental Mortality . Dropoirts from treatment and control 
groups represent a critical- proble;n in psychotherapy research, 
-fysen ck and Pvachman declared that a dropout must be cjonsidered 
a treatment failure. Yet early termination can be. explained by 
a var4.ety of reasons other than treatment failure. These include 
economic problems, family or work problems unconnected with the 
psychplogical difficulties, aifielioration of symptoms, scheduling 

^h^ges, physical rllness unrelated to treatment, and even^ death. 

C 

Unless these alternative explanations are accounted for, the 
preirtiature tcr^iinators cannot be classified as either successes ^ 
or failures. Yet the decision to include or exclude terminators 
fm final statistics may have a substantial effect on the finding 
of ^ study. Because the decision is made on professional judgment 
rather than independent empirical justiff cation , the decision 
invites bias, ^^^'-^ 
Premature termination is best regarded as a -iJ'f oblem of the 



internal validity of the study and not confounded vith outcome 
measurement. In this study, the percent mortality was co*^ed 
separately for treated and untreated groups. These figures were 
occasionally difficult to ascertain and involved comparing 
degrees of fteedon in post-test analyses vith the numbers of 
suDiects orginally allocated to groups. A study might also 
have different rates of mortality at the times of the post- 
test- and the fcllow-u?. These different mortality percents 
were noted* separately. 

V ' • • 

Internal Validity , The internal validity of a study was 
judged on the basis of the assignment* of subjects to treatment • 
and the extent of experimental, mortality in the study. To be 
judged high on the internal validity scale, a study must have 
used ^random assignment of subjects to groups and have a rate of ' 
mortality less than 15 percent and equival^ent between the two 
groups. If mortality was higher or nonequivalent , internal 
vali(fity vaF still rated high if the experimenter include^d the 
scores of the terminators in the post-test statistics or established 
the initial equivalence of terminators and nonterminatbrs . 
Medium internal validity ratings were given to (1) studies vith 
Mnddmization but high or differential mortality; (2) studies vitii 
'^a^-ed" randoirU:^ation procedures (e.'g., where the experimenter 
began by randomizing, but then resorted to' other allocation 
methods, such as taking the last ten clients and putting them 
into >tfie control group) with lov mortality; and (3) extremely 
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well-designed catching studies. Low validity studies were those 
whose matching procedures were quite weak or nonexistent (e.g., 
where intact convenience samples were used) or where mortality 
was severely disproportionate. Occasionally, statistical or 
measurement irregularities decreased the value assigned to 
internal validity, such as when an otherwise"^ well-designed study 
employed different testing times for treated and untreated 
groups. This measure pf internal validity was not contaminated 
by sample size, reactivity of measures, or the degree of blinding 
employed in the study. Ail four constructs were assessed separately 

Allegiance of the Experimenter . , Faith in the therapy on 

k 

the part of the therapist has been mentioned 'as a' putative cause 

4 

of positive therapeutic effects. From the tone and substance of 
/ 

the research report, it was usually possible to determine whether 
the experimenter was partial to the treatment evaluated. For 
example, when the report contained enthusiastic endorsements 
of the therapy, this variable was coded as pba^itive. VThere a 
second therapy was clearly a foil for the favored therapy, this 
variable was coded as negative. Placebo treatments were always 
coded as negative. Where the experimenter was the therapist, 
this variable was coded positive. 

* 

r 

Therapy^ Hodalitv . Each study was coded for the modality in 
which the therapy was delivered — individual, group, family, 
mixed modalities, automated, or "other." 

■ r- 

Us 



Treatmen t Location , Each study was coded according to the 
location in wHich the therapy was delivered — school, hospital, 

♦ 

mental health center, other clinic, private practice, college 
facility, prison, residential facility, or *'other.V 

Therany Duration, The duration of therapy, both in number 
of hours and weeks, was recorded. The rate (hours'^p^ week) of 
therapy was coiQputed from these two variables. 



rnerapis t Experience . The number of^ therapists used in the 
study and their experience in years was recorded. Because 
reports were frequently lacking this information, the following 
conventions were developed for translating relevant bits of 
information into years of therapist experience when no more 
specific information was given: ^ 

Undergraduates or other untrained assis^tants « 0 years 
MA candidates ^ « \ y^^^ 

MA-level counselor or therapist « 2 years 

Ph,D. candidate or psychiatric resident » 3 years 

Ph.D. -level therapist « 5 years 

Well-known, Ph.D. -level therapist « 7+ years 

t 

Outcome Measurements . Previous reviewers have struggled 
with the philosophical and technical pr'oblens connected with the 
selection artd measurement of outcomes. A reviewer might count a 
study as supportive or not supportive of the effectiveness of 
psychotherapy based^ on the statistical signif icancetof the outcome 



» 

. measure. Yet most studies employed more than one outcome measure, 
using. different instruments or the same instrument given at 
different times after therapy. When different measures produced 
different results, several' strategies were employed to cope with 
this problem. A study could be counted twice, for example, with 
one vote for)^and one against the therapy. Or, if a study showed 
posi^ve effects at therapy termination, but no effects at the 
follow-up, that study could be listed as a negative indicator of 
, therapeutic effectiveness. This strategy exemplifies a confusion 
^ between the use of empirical research for theory building and 
research done for evaluative purposes, i.e., to determine the 
effects and practical value of a treatment. The direction of 
desirabfe therapeutic effect was obvious in nine out of ten 
cases by examining the research hypotheses stated by the 
experimenter or the narrative description of results. In the 
remainder of cases, the Mental Measurements Yearbooks were 
consulted, or other studies that had employed the same measure. 
Each ^outcome measurement listed by the experimenter was used in 

♦ 

the meta-analysis. Each measure was weighed equally; however, 
. redundant measures were eliminated. If, for example, a second 
measure matched the first in outcome type, degree of" reactivity , 
follow-up time, and approximate siz.e of effect, the second measure 
was deemeid redundant and ordinarily not included in the meta- 
analysis. When subtest scores of multifactorial test batteries 
(e.g., M^rpi) were reported, and the subtests yielded results that 
>^were only randomly different from one another, an average of the 
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subtests was used. Total test battery results were used in 
, favor of separate subtest scores. 

the specific outcome was recorded and grouped into one of 
twelve outcome types: (1) fear or anxiety measures; (2) measures 
of self-esteem; (3) tests and ratings of global adjustment; 
(4) life indicators of adjustment; (5) personality traits; 
(6) measures of emotional-somatic disorders; (7) measures of 
addiction; (8) sociopathic behaviors; (9) social behaviors- 
(10) measures of work or school achievement; (11) measures of 
vocational or personal development; and (12) .physiological measures 
■of stress. The table below contains the outcome measures that 
were grouped within two outcome types: life indicators of 
adjustment and social behaviors: 



# 

Outcome labels 


grouped into two outcome types 


Outcome type 


Life rndicators of adjustment 


Social behaviors 


Number of tjm^s hospjtahzcd 


Interpersonal maturity ^ 


Length of hospitalizations 


Interpersonal interaction 


Time out of hospital 


Social relations 


Eniplnymeni 


Assertive r^css 


Discharge from hospital 


IPAT locubiiity scale 


Completion of lour of duty 


Acceptance of others 


Recidivism 


FIRO-B 




Dating behavior measures 




Problem behavior in school wia! setting 




Social effcctjvcr»es5 . 




S<K:ial dKtrc^^ 




SoLiomctriL Mati/^ 




Social distafKc ^aic 




S«ul adjustment 
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Reactivity of Outcome Measure , Highly reactive dnstruments 
are those that reveal or closely parallel the obvious goals or 
valued outcomes of the therapist or experimenter; wh^ch are under 
control of the therapist, who has an acknowledged interest in 
achieving predetermined goals;, or which are subject to t^he ' 
client's need and abilit/ to alter his scores to show more or 
less change than what actually took place. Relatively nonreactive 
measures are not so easily influenced in any directibn by ariy 
of the parties involved. Using this definition of reactivity, 
it was possible to define a five-point scale with the low end 
anchored at unreactive measures, such as physiological measures 
of stress (e.g.. Palmar Sweat Index) ar^d anchored at the high 
end with therapist judgments of client improvement. Points on the 
scale are further illustrated in the following table: 




Conventuons for assigning viJucs of rtactivit) to tests ind, ratines 

RtaLtivity 

vilue . ' Tests and ratings of therapy outcome 

1 (lowest) Physiological measures (PSI. Pulse. GSR), grade point average 

2 Blinded ratings and decisions — blind proactive test ratings, blind ratings of 

symptoms, blind discharge from hospital 
^ Standardized nr«axurcs of traits hivmg minimal connection with treatment or 

therapist (MMPI. Rotter I-E) 

4 Experimcnier-construcied inveniones (nonblind). rating of symptoms (nonbhnd). 

any client self-report to expcnrrHintcr. bimd administration of Behavioral 
Approach Tests 

5 (highest) Therapist rating of improvement or symptoms projective tests (nonblind), 

behavior in the presence of therapist or nonhlind eviluaror fe g . Bchaviofal 
Approach TcM). msirun^ehis that have a direct and obvious reidtionshfp with 
treatment (e g . where dcsensiiiz^on hierarchy nems were taken directly 
from measunng instrument) 
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« Treatment . determine whether the therapeutic effect 
produced in *a study was relac.ed to' the cype -of creatmenc used, ^ 
a system for cacegori2ing tyreatmencs was developed. 

1) Jsychodynamic theraoies were • tiiiose employing concepts 
such is unconscioufe mocivation, transference relationship, 
.defense mechanises,, structural eleraencs or" personality (id, ego> 
superegoi) ego development and analysis., 

• , 2) Dynamic-eclectic, xherapies are based on dynamic perscTnalicy 
theories, but employ a wider range of therapuetic techniques 
and interactive concepts than the more orthodcK Freudian theory. 

3) Adlerian therapy (Adler is referenced by Dreikurs and 
^ther^ is based on the never-ending strivings of the 
personality to escape from a sense of inferiority. Striving 
for superiority alienates people from love, logic, connnunity 
life, and social responsibility. , « ' ' 

^ Hypnotherapy (Wolberg) is one type of therapy that uses 
.hypnosis ate a tool for increasing relaxation and suggestibility 



moSx^ 

and weakefiing ego defenses^ As described by Lewis Wolberg, 



hypf\otherapy is closed related to psychodynamic theory, suggesting 
that such neurotic staM^a* anxiety, hysteria, and compulsj^s - 
are susceptible fo this treatment. ^ • 

'° 5) Client-centered or nondirective psychotherapy is 
associated with Rogers, Truax, |^khuff, Gendlin, and Axline 
(nondirective pla^ therapy .wit^i children) among b^fers. The 
Key concepts'. of this therapy include the necessary conditions of 
therapist congruence, empaiihy, and unconditional positive regard 

A . > 



for' the client . 

6) Gestalt t^herapy vas developed by Perls (Perls, Hefferline, r 
and Goodman) and, like Rogerian thetapy, is humanistic and 
phenomenological in philosophy. The key concep^^l^n this th^apy 
is awareness. The healthy person can readily bring into awaifene&J 
all parts of his personality and apprehend them as an integrated 
whole. Therapy is a process of heightening awareness through 
immediate here-and-nov emotional and physical experiences and ^ 
exercises and integrating alienated elements in the person 
(e.g., healing the "splits" between body and mind, conscious and 
unconscious) . 

(7) Rational-emotive psychotherapy was developed by Ellis and 
rests on a cognitive theory of human personality and therapeutic 
; intervention. The ABC theory holds that human reactions (C) follow 
from cognitions, ideas, and beliefs (B) about an event, rather than 
from the event itself (A). The beliefs may be either rational 
(logical, empirical) ' or iY-rational. These irra^fcnal beliefs 
are common for people in distress and pervasive itj our society. 
They include the notion that one must be universally" loved, or 
That failure. at a task is utterly catastropic. The therapist 
demonstrates the ABC theory in relation to the client *s problem, 
j:onvinces the client of the truth qfc, ^^.^ theory, confronts the 
irrational reactions, and teaches the client to confront them 
. himself. The objective of therapy is to replace the irrational, 
self-defeating cognitions with logical and empirically valid 
cognitipns . 
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8) Other cognitive therapies cc5inprise a family of 
therapeutic theories related to Ellis's ratioaal-emptive psycho- 
therapy in that the place of cognitive process — faulty beliefs, 
irrational ideas, logically inconsistent concepts — is -central. 
Tneorists in this family include George Kelly, Victor Raimy, 

"^""^nd Donald Tosi. They are similar in that the therapies are 
often active, /didactit , directive, sometimes bordering on being 
nortatorjc^-^The therapists confront logical inconsistencies, 
interpret faulty generalizations and self-defeating behaviors,, 
assign tasks to work on, and generally use suggestion and 
persuasion to get the client to give up his self-defeating belief 
system, 

9) Transa'ctional analysis is primarily associated. With 



Eric Berne who developed a personality theory based on three ego 
states — the parent, adult, and child — and the interrelationship 
of these ego states within a person and between p&rsons. All 
beliefs, 'cognitions, and behaviors are under the control of 

/these ego states, ^^^erapy consists of on-g^p^'' (usually group) 
diagnosis and interpretation of the structural elements of ' 
communication and interaction, vith^he goal of impryved reality 
testing and complementary transactions. 

10) Reality therapy is identified with William Glasser 

^and is based on the idea that persons who deny reality are 
unsuccessful and distressed. Mental illness does not exist — 
only misbehavior that is based on the denial of reality. Reality 
is achieved by the fulfillment of the basic needs — to love and 
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and be loved and to feel self -worth (success identity).. The 
therapist establishes a personal relationship with the client; 
attends to present behavior rather than histor4.cal events or 
feelings; interprets^ behavior in light of the theory; encourages 
the formation of value judgments about correct behavior and a 
plan for changing behavior, rejecting excuses for a failure to 
change, and "the development of self -discipline . 

11) Sy^^lpnatic desensitization is a therapy based on 
scientific behaviorism., primarily associated with Volpe. In 
this vtherapy , anxieties are eliminated by the contiguous pairing 
of an aversive stimulus with a strong anxiety-competing or 
anxiety-antagonistic response. The usual procedure is to teach - 
the client deep muscle relaxation (a response anpagonistic to 
anxiety) and then introduce anxiety-provokfng stimuli, arranged 
in hierarchies, in connection with the relaxation until the 
client can confront and overcome the anxiety directly. The 
behavioral principles involved are reciprocal inhibition, counter- 
conditioning, or extinction. 

12) Ijnplosive therapy developed by Stampfl, operates on 

many problems similar to those addressed by systematic desensiti- 

ration, arid is IbaseS^on classical conditioning models. The 

therapist direct^ the client* s imagery so that he is forced to 

imagine the worst'po^sible manifestation of his fear, end the 

connection between conditioned stimului and conditioned response 
• V 

is extinguished. 




13) Operant^respondent behavior therapies are a family of 
treatment programs in which the scientific laws of learning are 
invoVed. The client is viewed as a passive recipient of reinforce- 
ment ^or conditioning. Proponents include Skinner, Staats, Bijou, 
and Baer. 

lA) Cognitive behavior therapies are a family of therapies 
in which laws of learning are applied to cognitive^ processes. 
Unlike the strictly operant or respondent theories, in cognitive- 
behavioral therapies, the client is more of an active agent in 
his own therapy ,^ occasionally even administering the treatment 
himsel^f (e.g., self-control desensitization) . Mpdeling treatments 
are included in thi,s family of therapies because the client must 
identify with the model and adopt the behavior for which the model 
(but not the client) is reinforced. Among the proponents ^f 
cognitive behaviorism are Donald Meichenbaum, Albert Bandura, 
and Mahoney. , 

15) - Eclectic-behavioral therapy is a collection of treatments 
that employ behavioral principles in training- programs designed 

to affect a variety o*f emotional and behavioral variables. 
Assertiveness training is the principal therapy, and Lazarus and 
Phillips are among the proponents. ^ ^ 

16) Vocational-personal development counseling involves 
providing skills and knowledge to clients to facilitate adaptive 
development. Frequently, a trait and factor approach is used with 
aptitude and personality testing, diagnosis, prescription, and ^ 

127. 
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interaction vitn the client to facilitate the development of 
personal; social, educatitonal , and vocational skills. Among 



the proponents are Theo Volsky and Williamson. 

17) ^ "Undifferentiated counseling" refers to therapy or 
counseling that lacks descriptive inforciation and references that 
would identify it with proponents of theory. It is usually 
practiced in schools (i.e., the clients were given ordinary' 
counseling) , but sometimes is used as a foil against which a ' 
more, highly valued therapy can be compared. That it cannot be 
attributed to any single theorist or group of writers is indicatfive 
of its lack of theoretical explication^ 

18) Placebo treatments were often included in an experimental 
study of therapeutic effectiveness, Placebop were used to test 

the effects of client expectancies-^ therapist attention, and other 
nonspecific and informal therapeutic effects. The placebo ^ 
treatments tested in ^the meta-analysis were the following: 
relaxation training,, attention control, telaxation and suggestion, 
relaxation and visualizatioj^f scenes in an anxiety hierarchy, 
group discussion, reading and discussing a play, informational 
meetings, pseudo-desensiti^zation placebo, written information about 
the phobic object; biblioiherapy , high e3$)ectancy placebo, 
visualization of reinforcing, scenes, minimal contact counseling, 
T-scppe therapy^ pseudo-treatment control, and lectures. 



A scale was develppeid to indicate the^ degree of confidence in 
classifying therapy labels into t'herapy types. Th^ greater the 

« * 
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number of concepts, descriptions, and proponents n^ed by 'the 
experimenter and associat^ vith a major school pi thought, 
the higher the value assigned to this scale. Tpe highest value 
(5) was given to a study when the major proponent of a theory 
actually part!icipated in the study, or when the therapy sessions 
were recorded and rated for their fit vith the theory. The low 
point of the scale (1) was given to studies when the experimenter 
provided almost no key concepts or references. On this five-point 
scale, 15 percent of the studies fell into ^-he highest category, 
42 percent in the next highe^st,. 24 percent in the middle category, * 
• and 19 percent in the lowest two categories. Tne mean for the 

confidence of classification scale was 3.5 (standard deviation « 1.0), 

We have presented, so much detail about the psychotherapy study 
characteristic^ and the ccmventions .for coding because we can imagine 
that many of the items, particularly those dealing with experimental 
methods, are of general usefulness. This chapter concludes with an 
example of a study coded according to 'the conventions described above 

( 

and the items on the psychotherapy study coding form in Appendix A, 

V 

The study used as an ^example was perf onned by Krumboltz and Thoresen 
(1964) "and is reproduced in Appendix B. Its description 'appears as 
Table U.k. 
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Table 4.4 



Classification of a stud> by Krumbolc and Thorcicn (1964) 



Publitauon daic 
PubliLaivon form 
Training of cxpcnmentcr 
Biindmg 



Diagnosis 



Hospriahiduon 
intelligence 

Clicm-ihcrapjst simiianiy 
Age 

Percentage male 
, ' SoiiCitaiiOn of ciicnis 

, Ais'Cnnenl of Chen! 
Assicnmcni o:' therapist 
Expcnmcnij' mor.ahts 
Internal vaiidifv 
Simultaneous compvison 

Type of treatment 



Confidence of cla$si5caiion 

Allcpance 

Modality 



Location 
Durat lOn 

Expcnence of therapists 



Outcome 



Effca ijic 



1964 • 
Journal 

Education (Lnown b> institutional affihaiion) 
Expcnmcnicr fcvaluaiors) did not do therapy, but did know 
group composyion [no miormaiion aCK^ui blinding of 
evaluators wai given) 
Vocaiionllly urKiccidcd (students who asked for counseling 
^ about future plans, grouped m "neurotic *" diagnostic 

None 

Average (estimatca. m the absence of oihH rnforroatroni 
Moderate!) simiUr (a^s differed, but socioeconomic siaius 

of community indicated wmilanty)- 
16 fhign school juniors ) 
50^ (sample strjtifiea by client scx) 
Clients vorunieered arter being g-ven nonce that coun^ling 

^wouid be available 
Random (siaied) 
Randorr 

No subjects lost from any group (Stated) 
High 

\ es (2 treatments groups and pUcebo group compared agajnst 
cony^O - - , 

(1) Model rcmforcemeTK— Cogniuve beha\ioral subclass 
(studcrus were shown upes of models bemg reinforced 
for informaiion-sceking behaMor. but students were not 
reinforced personally) 

(2) Verbal rcmforcemcnt— Behasioral suoclass icounsciors 
verbally reinforced clients for production of infoTTnaiion- 
seeking statements) 

(3) Film discussion — PUcebo (clients saw md discussed a 
film, to control for nonspecific effects of counselor 
attention) 

Rated 5 (highest) (because of thoroughness of descnption. 
knowledge of cxpenmcnters theory amTprcvious work) 

Equal allcgiarwe paid to each ef treatments No allegiance lo . 
placebo condition 

Mixed (ni«iems were randomly assigned to individual and 
group treatments, but moJalttv did not interact with out- 
come, so the two motles were combined for the mcu- 
analysis) , 
, School (stated) 

2 hours. 2 weeks (2 scssionsr time estimated) 

2 years (est^aic<J by status in counsclor-irajfting program 

plus training for this cxpenmenn 
Two ootcomc nwasures were used frequency md vanety of 
information-seekmg behavior as estimated from 
responses to structured interview questions Reactivity 
was rated "4** for both, because rT>ea<ures were self- 
- rcpon of clients to nonblind evaluaiors These wefe 
classified a^ measures of vocational or personal 
devrlop^nt 

Statistics repot ,cd as treatment means and mean squares from 
a 4.factor^nalysis of vmance 

The effe^ sizes were as follow^ 

Frequency (of Variety (of 

informatKjh- information- 

scckmg lecking 

behavior) behavtor) 

t 29 0 77 

I 0.5 I 39 

0:i 0 27 



Model reinforcement 
Verbal reinforcement 
Placebo 
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/ CHAPTER FIVE • . 

MEASURING STUDY FINDINGS " ■ 

An quantitative, empirical studies aim to assess a particular 
phenomenon. In the case of experiments, that phenomenon is an effect of 
• an independent variable on a dependent variable and it is measured' by a 
difference between means, perhaps more then one such difference from a single 
experiment. 'In the case of correlational studies, the phenomenon of principal 
interest is tne relationship between two variables, .its strength and direction, 
usually expressed on a scale derivative of^ Pearson's. notion of product- 
moments.. In surveys, attention often focuses on a simple rate or incidence 
figure, e.g. ♦ 37 per.cent of people live in multiple-family dwe-llings. In 
a meta-analysis, it is the findings- of studies that correspond to the 
dependent variable. They are to be measured In quantitative and conparable 
terms, then described and accounted for by reference to the "independent" 
and "mediating" variables that are the study characteristics discussed in • 
Chapter Four. . > '' ' ' 

, In this chapter, we shall first consider the crudest level of 
'quantification of study findings, a level that is typical of recent techniques 
of research study integration. At this first level, studies are classified 
only as "statistically significant" of "nonsignificant." This primitive 
translation of conplex findings into crude categories proves to have some 
unexpected drtwtacks; and in mod'ified forms, it may yet prove to have some 
advantages in' a few special/nstances. Then we shall discuss at lengt^^ the 
properties and uses of the standardized mean difference for describing 
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• experimental effects. A special aspect of this problem that will be 
a<Jdressed is the measurement of experimenta l;effects , ^ , for dichotmously 
measured outcome variables. A brief sectiop will be devoted to the 
measurement of findings in correlational studies. The chapter concludes 
with a description of a measure of effect size recently proposed by 
Kraemer and Andrews. 

ff 

' • Vote-Counting and Other Crude Measures of Study Findings 

Tne most commonly used method of integrating research studies is 
what Light and STiith (1971) referred to as, the voting method . There 
exists a virtually huge numoer of such reviews, and no purpose wotild be • 
served by citing examples, here. Light and Smith characterized the voting 
method in these words: 

All studies which have data on a dependent variable and a 
specific independent variable of interest are examined. Three 
possible outcomes are defined. The relationship between the 
independent variable and the dependent variable is either * 
significantly positive^ significantly negative, or there is no 
significant relationship in either direction. The number of 
studies falling into each of these t^ee categories is then 
simply tallied. If a plurality of studies falls into any one 
of these three categories, with fewer falling into the other two, 
the modal category is declared the winner. This modal categorization 
is then assumed to give the best estimate of the direction of the 
true relati onshlp between the independent and dependent variable, 
(p. 443) 

O „ Light "and Smith pointed out that the voting method of 

study integratiofi disregards sample size. Large samples produce 



more "statistically significant" findings than small samples. 
Suppose that nine small-sample studies yield not quite significant 
results, and the tenth large-sample result is si-gnif i cant , The 
vote is one "for" and nine "against/' a conclusion quite at odds 
with one's best instincts. So much the wors€ for the voting method 
Precisely what weight .to assign to each study in an aggregation 
is an extremely complex question, one that is not answered 
adequately by suggestions to pool the raw data' (which are rarely 
available) or to give each study equal weight, regf^less of 
samfjle size, -If one is aggr^^ing arithmetic means, a weighting 
of results from each study according toYTT rright make sense, 
reasoning from an admittedly weak analogy between integrating 
study findings and combining independent random samples from a 
population. The problems of proper integration of statistical 
findings a^re not simply problems of sample size; if pursued for 
long, they lead back to the ambiguities of the concept of a "study. 

Some of the complications oY sample size can be avoided post 
hoc if the sample size, n^, of studies is not systematically related 
to the magnitude of the findings of the studies, for example, mean 
differences or correlation coefficients. Glass and Smith (1976) j 
J found for over 800 measures of the experimental effect of psych- 
therapy versus a control condition that the effect size had a 
linear correlation of only ,10 with ji and essentially no 
curvilinear correlation. Smaller size studies tended to show 
slightly larger effects, but the relationship was so we'ak that 
it is doubtful that any weighting of findings would. make any 
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difference in the aggregation. 

A serious deficiency of the voting method of research integra- 
tion is that it discards, good descriptive information. To know 
that televised instruction beats traditional classroom instruction 
in 25 of 30 studies if, in fact, it does ~ is not to know 
whether TV wins by a nose or in a walkaway. One ought to inte- 
grate measures of the strength of experimental effects or 
relationships among variables (accordii^g to whether the problem 
IS basical ly /experimental or correlational). Researchers commonly 
believe that significance levels are more informative than they 
are. Tallies of statistical significance or insignificance tell 
little about the strength or importance of a relationship. 

An example will demonstrate that the aggregation of even 
simple statistical information can create unexpected difficulties. 
There exists a paradox attributed to E. H. Simpson by Colin 
Blyth (1972) which has a counterpart in aggregating research ) 
results. Imagine that researcher' A is conducting a study of 
the effect of amphetamines on hyperactivity ^in sixth-grade 
children. (It is alleged' that amphetamines act as depressant^ on 
prepubescent children.) In A^s study, 110 hyperactive children 
receive the amphetamine, and 70 receive a placebo. After six 
weeks* treatment* each child is rated as» either 'improve^l* or 
'worse*. The following findings are obtained: 



impfovtd 

worso 



Studf A 






Amphet«mtr>€ 


PLlC€t>0 




50 


30 1 


80 


60 


' 40 


100 


VI 0 


70 • 


180 



0_ T^e improvement rate for the amphetamines exceeds that for the placebo 
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Suppose research^!- ,B is studying the same problem at a 

different site and ^fi^J^ the Vol lowing results: 

7 

Study B 





Ampr>etam(De 


Ptacet>o 




improved 


60 


90 


150 


Worse 


30 


50 


80 




90 


140 


230 



Again, the improvement rate for amphetamines is superior to 
that for the placebo: .57 vs. .54. 

. By the voting method of aggregation, the score would be 2-0 
in favor of amphetamines. However, an aggregation of the raw 
data produces the opposite conclusion: 

Studies A & B Combined 





Amphetamir>e 


Piacepo 




improved 


110 


120 i 


230 


V^ofse 


90 


90 } 


180 




200 


210 


410 



The improvement rate for placebo xnow exceeds that for amphetamines: 
.55 for amphetamines vs. .57 for placebo. 

4jiWhich method of aggregation is correct? Obviously they cannot 
both be corre^, since they lead tt contradictory conclusions. In' 
pondering this paradox and its implications for research integration, 
.it is helpful 'to note that (1) the pa/adox has nothing whatever to 
do with statistical significance, (2) the sizes of the differences 
in rates could be made as large or small as one wished by juggling 

y 
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the figures, (3) the basic problem is related to'the problems of 
unbalanced experimental designs' (Simpson's paradox could not 
occur if amphetamine and placebo groups were of equal size 
within each study), and (4) the practical consequences of the 
paradox are not negligible -- it occurred, for example, .in 
ft study of sex bias in graduate school admissions (see Bickel, 
Hammel, & 0' Connell , "1975; Gardner, 1976). 

Hedges and Olkin (1980) discovered some Intriguing and unexpected 
deficiencies in ' the. vote-counting method of integrating studies. They 
assumed that J studies each with sample size n are performed. " In each 
study, the same effect size u- (u ^ - ^) /p ^ is estimate'd. 
The findings of each study are evaluated by a two-tailed t-test 
•of mean differences at the .05 level .of significance. Each- result is 
classified into one of three categories: negative significant, positive 
significant, or statistically insignificant. The decision rule i5 that 
^he over-all result is regarded as supporting the hypothesis (thatP^ 
is greater thanu^) if a p'lurality (i.e., greater than one-third) of 
the studies fall into the "positive significant" category. 

Hedges and Olkin- assumed normally distributed variables and thfen 
calculated the probabilities for various sample sizes and numbers of 
studies, J, that more than a third of. the studies would fall in the 
"positiye^gntficant" category. In Tab!? 5.1 appear the one's 
complement of these probabilities; thus, the tabulated probability is 
the probability of failing to detect an effect size,^of a given 
size by the one-third plurality rule. Consider, for example, the case" 
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TABLE 5.1 

Probability chnt a Standard Vote Coun^Fails to Detect an 
Effect for Various Sanplc, Effect and Clust.er Sizes. ^Each 
i|pf the J* 'rcii>licated 'studies has a cpnnon sanple size .n. 
A tvo.-tailed t-test/ is^ used**to test mean differences at tke 
.0^ level tf significance. An effect is detected if the pro- 
portion pos-itive Significant results exceeds ' one-third . ^ 



. Kuabcr,J , 
studies to 


of 
be 


/Sanple size, 
n, per study 






Effect 


size A = (vi 
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• 1 
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.994 


. 985 


.968 


.-935 


.880 






20 
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.99cr 


.966 


.906 


.987 


.606. 


.J95 






30 


.999 


. 995 


.975 


.906 


.947 


..502 


■.252 
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AO ' 


0 0 0 

• y y y 


O^T 


o c o 
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.547 


.254 


.'073 


.012 


\ 10. 




ll^50 
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.986 


.914 




. ^ O 
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1 J 




10 
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.999 ' 


.997 


,991 




■ .939 


±J 


V 
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1. 00 
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.9'91 


.958 


.862 , 


'.672 


.419 


' Id 
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1.00 


.999 


.99A 


.958 


.824 


.549 


.244 


.064 
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,999 


. 983 


.885 


.604 


.246 


.049 


^4 






,50 


.999 
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.952 


. 770 




DSD 




.000 
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1.00 


1. 00 


.999 


'.997 


.989 


.966 


•.914 


20. 




20 


1,00 


1.00 


.99^ 


.988 


.941 


.800 


.545 


.2^5 


. 20 




, 30 


1,00 , 1.00 


.993 


.941 


.747 


.400 


.118 


.016 


20 






1.00 
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.978 


834 * 








.000 


20 




"50 


1.00 
1.00 


.997 

l.qo 


.948 
1. 00 


,672 
1.00. 
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.999 
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.997 


001 
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.000 
.954 • 
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.863 
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- 30 


1.00 


1.00 


.998 


1 
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.994 


.915 
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1.00 


1.00 


l.DO 
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.035 
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.036 
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of. IB studies in each of which l is estimated fnom n ='50 cases and 
, the true effect size being estimated eguals .40, a fairly large effect.- 
Hedges and Olkin's table shows that the probability of n£t deciding 
»A that. there is a positive effect using the vote-counting strategy is 

.770, i.e., the probability of error is greater than three-quarters! What 
IS even more remarkable i_s that for l< .40, the probability of 'making 
tne'''err^^^^cated^ increases as the number of studies integrated 
increases: Clearly, there is much that is unacceptable in research 
integration by means of vote-counting. 

Igtegratinq Significance Tests • ' 

Some researchers have set forward as the principal' problem of 
research integration the combining of si'gnif i cance levels into a joint 
test of a null hypothesis. Gage (1976) contributed a considered and 
ilTuminating paper on integrating studies on teaching. Following an 
astute. critique of the Voting method, he posed the aggregation problem 
as a problem in determining whether several individual studies, many 
of which showed no significant torrelation, constituted in the aggregate 
sufficient evidence to reject the null hypothesis at a high level of • 
significance. He employed the chi square method of K.' Pearson (1933) 
and E. S. Pearson (1938) via Jones and Fiske (1953). If k_ independent 
studies yield significance levels, ij. £2 • • • . £|^, then under the 
common null hypothesi s'' tested in eBch study: 
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This approach seems defensible and more powerful than a binomial 
test — testing whether the probability of ''positive" findings is different 
from .5 — where statistical hypothesis testing is a genuine concern. 
For most problems of meta-analysis, however, the number of studies will 
be so large and will encompass so many hundreds of subjects that null l( 
hypotheses will be rejected routinely. F^€rhaps it is more realistic v 
to think of the typical meta-analysis problem as residing in that vicinity 
the statistician calls "the limit," where. all null hypotheses are false 
and inferential questions disappear. The statistical integration of 
studies probably ought to fulfill descriptive purposes mor^ than inferential^ 
one*^ though obviously it may fulfill both. 

* If the Pearson X test of combined results begins to play an increas- 
ingly important role in research integration, methodologi sts will need 
to scrutinize its assumptions .and properties. It is probably quite 
sensitive to nonindependence of studies (cf. Jones & Fiske', 1953, pp. ^ 
517-381). Furthermore, the extreme tails of distributions are exotic 
places about which more would have to be learned. For example, violation 
of normality assumptions . has little effert on 95th and 99th percentiles 
of 3^and £ distributions, but conceivably iVcan change a £ of .001, ' 
under normality, to a £ of .0001, which is a di?^rbance in natural * 
logarithms from -6.91 to -9^21. 

Rosenthal. (1978) recently evaluated nine different methods that 

have been used at one time or another to aggregate statistical signifi-^ 

\ 

cance measures from tnany studies. These methods include addition of 
logs of p-levels mentioned abovji as well as adding probabilities 



ER|C ^ ^ 128 ^ ^ 




(Edginton, 1972a), adding t's (Wjner, 1971), Stouffer's method of 
adding Z's (>1ostener and Bush, 1954), adding weighted Z^'s (Mpsteller . 
and Bush, 1954), testing the average £-T^ve^(Edgington, 1972b), testing 
the average I (Hosteller and Bush, 1954), counting (vote-method), and 
blocking (see Rosenthal, 1978, p. 190). Rosenthal's summary of the 
advantages and limitations of the various methods appears as Table 5,2. 

Table 5,2* 



Advantas^s and Limuations Sine Methods of Comoinxng Probabilxtus 



Method 



Advanu^es 



Limitations 



Adding logs 
Adding ps 

Adding /s 
Adding Zs 

Adding weighted Zi 

Tcftmg mean p 
Testing mean Z 
Counting 

.Blocking 



Well established 



Good power 



Unaffected by of studies, 
'Sgiven mi|irtiim dj per 

Iljj^tinel^aDphcabie, 
yimple V 1 jt 



Routinel> applicabie, 
permits weigfct^ng 



Simple 

No assumptioriHJf unit 

variance 
Simple and robuft 



Display* all'mcans for 
inspection, thus faaii- 
Utmg tearth for 
moderator traria| 



Cumulates poorly, can 
support opposite 
conclusions 

inapplicaWe w^ien A' of 
studies (or p%) is large, 

, i?nle$s complex correc- 
tions are i^jjroduccd 

Iji^licable when /s are 
wed ^n very it^\d} 

Assumes unit variance 

when undcrtaome 
' '^^S^pditions Type I or 

.Type II errors may be 

increased 
Assumes unit variance^ 

when under some 

conditions Type I or 

Type II errors may be 

increased 
^ of studies should not 

be less than four 
Low power when TV of • 

studies IS small. 
Large N of studies is 

needed » may be low 

in power. 
Laborious when ^ is large; 

insufficient data may 

be available. 



Applicable when 



A' of stAidies IS small 

(<5) 



4 



of studies is small 
< 1.0) 



Studies are not 
based on too 

few df 
Anytime 



Whenever weighting 
IS desired 



N of studies >4 
^ of studittf ^5 
^ of studies IS large 
K 

^ of studies IS not 
too Urge 



After Rosenthal (1978), re^^inted by permission of the author 
and publisher. 
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Scaling Experimental Findinns 

For several reasons and in Several ways it may occur that the findings 
of a comparative study exist only in the form of a report whether one 
mean (median or whatever) is higher or lower than another. This' most basic 
report of a finding can arise from 1) very rudimentary reporting in a brief 
article,' 2) the desire' to avoid making dubious assumptions, or 3) incomplete 
4ata which obviate the calculation of a metric measure of effect or cor- 
relation. Thus, a data analyst attempting to integrate the findings of 
many studies may have in hand data of the following type: in 75 comparisons 
of treatments A and B, A exceeded 45 times on the outcome measure, and B 
exceeded A tne other 30 times. The key to converting these rudimentary 
results into metric measures of effects or correlation lies in traditional 
methods of psychometric scaling. In particjjUr, if one can assume normality, 
then Thurstone's "law of comparative judgment" can be applied directly and 
the proportion of times A exceeds B can be translated directly into a 
measure of standardized mean difference between A and B {see Torgerson, 
1958, p. 159ff). 

We have applied this procedure in connection with a meta-analysis 
of research on the relationship of class-size to achievement (Glass and 
Smith, 1979). 

^ Only the post-1960 studies wer^ included in the scaling analysis. 
The regression analyses show that studies done prior to 1960 showed little 
relationship between clas,s-siz£ and achievement (probably because of poor ^ 
design, poor measures, and because genuinely small classes--less than a 
do^jen pupils, say--were seldom studied). The post I960' studies produced 
246 values of for which one needs only to note whether L is positive^or 
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negative. In addition, there were a small number of studies that yielded 
only comparisons of the sizes of the achievement means for the small and 
largft classes » but no metric information from>/hich A might be/calculated. 
The principal study of this type was Forno and Collins (1967). The 
findings from these* studies could be included in the scaling analyses even 
though they could not be included in the regression analyses. The total 
number of paired comparisons was 559. 

^ • The class-size dimension was broken into five categories in an 
attempt to obtain an even distrtbution of comparisons. These categories were 
as follows: 1-11 pupils, 12-22, 23-32, 33-42, 43 or more pupils. The actual 
average class-sizes falling into these categories vere as follows: 2, 18, 
'28, 38, and 84 pupils. These averages will Be used to represent the 
categories. Thus, a. comparison of achievement means for classes of sizes 4 
and 30, for example, will be spoken of as a comparison of classes of size 
2 and 28. - 

The following^requency matrix was obtained by counting direction 
of superiority in ,the paired comparisons; 



Paired Comparison Frequency Matrix 
Class Size 





7 of 8 


45 of 45 


3 of 3. 




1- of 8 




111 of 160 


124 of 157 


2 of 3 


1 of 45 


49 of 160 




109 of 167 




0 of 3 . 


33 of 157 


58 or 167 




1 of 6 . 




1 6f 3 


4 of 9 


5 of 6 





2 

18 
28 
38 
84 
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This matrix is read as follows: each entry represents the number of times 
the row classjSize had a higher achievement mean^than the column class-size. For 
example, there were 46 comparisons of class-si2^v2 and class-size 28; in 45 of 
)^ them, achievement was superior in the class of 2. 

It was decided at this point that some comparisons were so infrequently 
represented that including them in the scaling analysis might greatly overweight 

their unstable estimates. It was decided arbitrarily to include cxnly those cells 
with. more than a nalf-dozen comparisons-. Thus, the following three cells (three 
on each siae of the diagonal) were eliminated: row 1 - column 4 ; row 2 - column 
5; row 4 - column 5. "Tne resulting frequency matrix is then transformed to a 
proportions matrix, tt. e.g.. Ill of 160 = .59 and Ihen to an X-matrix where X.. 
IS the unit normal deviate below which lies tt^.^. proportion of tne normal curve. 
The TT and £ matrices'" are combined in the following figure: 





2 


18 




28 


38 


84 


2 


1 

1 - 


r = 

X = 1 


.88 

.18 


.98 
2.05 






18 


.12 
-1.18 




.69 

.50. 


.79 
.81 




28 


.02 
-2.05 




31 
50 




.65 
.39 


.56 
.15 


38 






21 
81 


.35 
-.39 






84 






.44 
-.15 







The solution for scale values follows Gulliksen's (1956) least-squares solu- 
.tion for incomplete data. A vector Z is formed by sunning the columns of X: 
-Z^ • (3.23, 0.57, -2.01 , -1.20, -0.15). A matrix of H of order 5x5 is fornied 
such^at a -1 Is entered in each off -diagonal cell in X that is not empty, a 
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zero IS entered for each empty cell, and the diagonal entry is the number of non- 
f^mpty cells in the corresponding, col umn of X. The last scale value, correspond- 
ing to class-size 84, is arbitrarily set equal to zero, and the last row and 
column of M are deleted. The reduced matrices, and I^, are pombined to form 
the normal equations of the least-squares solution for the scale values: 

The estimates and their solution are as follows: •- " * 



'1 



1 



.T 



2 


•1 


-1 


o" 


-1 


3 


23 


-1 


3 


-1 


-1 




0 


57 




- 1 


4 


- i 




-2. 


01 


, 6 


-1 


.1 


2 


3 


-1 


20 



1.625 
1.250 
1.000 
1.125 



1.250 
1.500 
1.000 
1.250 



1.000 
1.000 
1.000 
1.000 



1.625 





3 


.23 




D 


57 




-2 


01 


J 


-1 


20 



S = (2.60. 1.38, 0.59, 0.39, 0) 

Tne graph of the scaled relationship between class-size and achievement 
appears as Figure 9. The scale values on the ordinate of t% graph are .arbi- 
trary. The quaoratic equation which best fits the five poi'nts by the least- 
squares criterion is as follows: 

s = 2.78912 - 0.09318(Size] - 0. 000715(Size)^ 
The multiple R-squared is 0.99. The following estimates of achievement (on 
an arbnrary scale) for various class-sizes were obtained from the regression 
curve: 



. Estimated Scale Value Decrease in Achievement 

ilii for Achievement From 10 More Pupils" 

1 2.70 .86 

10 1.93 72 

20 1.21 

30 ' 0.64 

40 0.21 

50 -0.08 

60 -0.23 



.57 
.43 
.29 
.15 

0 



70 -0.23 
80 -0.09 

^ The curve in Figure 9 shows the expected and quite plausible decreasing 
deceleration in achievement as class-size increases. 
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Figure S.LRelationship between class-size and achievement (arbitrary units) 
obtained by psychometric scaling of conparisons 
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FINDINGS OF EXPERIMENTAL STUDIES 

The description of findings in experimental studies so that results 
can be aggregated and their variability studies present several technical 
problems. The findings of comparative experiments are probably best 
expressed as standardized mean differences between pairs of treatment 
conditions. It will seldom be satisfactory to express experimental 
findings as a measure of association between several levels of an 
independent variable and a metric dependent variable. Such association 
measures (e.g;,uj ) are descriptive of a complete, somewhat arbitrary, 
set of experimental conditions an investigator chooses to Investigate 
in a single study. For example, if one 'wished to determine the 
comparative effects of computer-assisted and traditional foreign language 
instruction, then it is irrelevant^ that a televised instruction condition 
was also present in a study, and one would not want a quantitative 
measure of effect to'be influenced by the irrelevant-condition (Glass & 
Hakstian, 1969). 

In what follows, reference will be made to the comparison of a 
particular experimental condition with a control group. Of course, 
there may be no "control" group in a traditional sense, and one could' 
imagine that two different experimental conditions are compared. Th^ • 
most informative and straightforward measure of experimental 
effect size is the mean aifference divided by within-group standard 
deviation: . ' ' ' 

^ (1) 
' ^x * ■ 
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Suppose that four experiments were performed in which either nialonide 
or iproniazid was compared with a placebo for efficacy in relieving 
depression. Three of the experiments measured outccmes with the MMPI D 
scale; the fourth study used the Be^ Depr-ession Inventory. Suppose the 
following results were obtained<^ (The data are hypothetical, but the ' 
findings are close to those reported in Smith, Glass and Miller (1980)). 

Study No . Confiparison Test Means 5t. dev . A-B 

1 Nialomide vs. Placebo ,MMPI 70.10-70.50 9.50 -.04 

2 Nialomide vs. Placebo MMPI 51.45-52.31 11.25 -.08 
. 3 Iproniazid vs. Placebo MMPI 60.21-65.15 7.80, -.63 

4 Iproniazid vs. Placebo Beck 110.75-121.45 20.50 -,52 

In the above data, the average effect of nialomide is -.06, 
i.e., si)f-hundreths standard deviation superior to a placebo; the average 
effect of iproniazid is -.58, more than a half standard deviation. 

The meaning of A is readily comprehended and, assuming some distri- 
bution form, can be translated into notions of overlapping distributi ons- 
of scores and comparable percentiles. For example, suppose that a study 
of the effect of ritalin versus placed ort reducing hyperactivity reveals 
an L of -1.00. "One knows immediately that the average child on ritalin"' 
shows hyperactivity one standard deviation below, that of the average ■ 

chjld on placebo; thus, assuming normality, only 16 percent of the placebp 

/' 

' children" are' less hyperactive than the average child on the drug, ' • 
and so on. 



Anoti»er way to interpret the .magnitude of the effect size is to 
compare it to other effect sizes, particularly for effects that many 
people have external references for how strong the treatment was. , One 
TV program that the American public has enthusiastically endorsed is Sesame 
Street. Effects of Sesame Street on social behavior, such as cooperation, 
were included in a meta-analysi-s. However, the primary aim of Sesame 
Street, particularly the first year, was cognitive skills instruction — / 
prereading, language, and math. These cognitive outccne measures were not 
considered in the meta-analysis, but are considered by many parents and 
preschool teachers to be substantial. 

In 1970 and again in 1971, the Educational Testing Service (ETS) 
conducted a field study^ev^l uation' of Sesame Street. Both years had numerous 

« 

measurements, several subsamples, several research designs, and confounded 
resu-Us making a single numerical summary statement difficult. The most 
easily interpreted results compared two groups of to 5 year old disadvan- 
taged children, of which one group had not seen Sesame Street while the 
other had watched for one season. The criterion measure was a special 
test developed by ETS covering the cognitive skills taught on the program. 
The tendency was for those who watched mOre to gain more although 
viewing differences were confounded with intelligence and other background 
variables. The effect sizes for four levels of viewing versus no viewing 
all favor the Sesame Street viewers, varying^^rom .53 to 1.45, with a mean 
of 1.00. . * ' * • * 

^ A more controlled analysis was possible the second year with 283 
Chi Idren^randomly assigned to groups who either had or did not^^have a TV to 
view the program. A set of covariance analyses (covarying on pretest score, 
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\pretest PeSbodylQ, and SEs) resulted in seven effect sizes, varying from 
^.9A fo .-BA.- Dropping the "parts of wh^ole" test that was a low SQtlier 
the mean effect size. was. .45 with- a. stan'SSrd deviation of .085. -The remaining 
tqi^, covered the topics" of number, sorting, forms, pre-reading,' relational 
' terms, a^ classification^ ■ ^ ' 

Electric Company, the Sesame Street sequel- for older children, was 

.evaluated by ETS in a57'5. &nd 4974. ' Again, there were numerous analyses, 

■".<'!. ' _ . 

\\ but using the total score on an ETS reading test as the criterion measure, 

• . ' ^ - ' "^^ ■ • 

^ for children Jn grades one to four, in two cities comparing those who were ^ 

encourage^ to watch the progi^m at home Versus those who were rrot encouraged 

th^l^rage effect size was ^17. This effect is low partially because 

nqriTentouraged children also wal?>ied the program, thus this effect size 

is. a measure of increased" readifig achievement 'due 'to increased watching 

' \ .■ . . 

wnen encouragS« bya teacher to vi.ew the program after school. 

Both the firslf^^d'- second year eval uation^ also had an in-school 

.experime^ntal 'design component. Two locations with 'large -number? of 

either Spanish speaking or black children were assigned to teachers who 

were encouraged to shpw the program regularly during th^ytar or who- 

• were asked no^to. "Hjb amount of viewing and supplement^ftristruction 

^, was teacher de1;pnninecr. Two outcome measures, the . ET? reading test and 

the Metj-opolitan Achiev^men-t Test provided similar results.. Averaging- 

the data frorh two locations, grade* one through, three and the two years, 

resulted i^i ^n effect size of .43' (S. p. » .30) for the ETS reading test 

'^and .35 (S.D: =^-3^) for'the Metropolttan Achievement test. The overa'll 

average i9fc.39, wijth'scores ranging^ from ^.Q3' to^'l.02.- 

' •• . ■ - . : ■ 



Interpretations jDf effect sizes, a in terms of percentiles 

(e.g., if A* = +1.00, then the average person in the experimental group 

>ias a score that exc'e^ds 84 percent of the persons' scores in the 

control group) depend, of course, on assumption^^about the shapes of the 

distributions of. the variables in the two groups. Normality is a convenient- 

and unobjectionable assumption in many instances, but its convenience 

should not blind one to the fact that it is an assumption that may occasionally 

be ^alse. Kraemer.and Andrews (1980) have called attention to thi's problem. 
Suppose,) for example, that the scores in the experimental and control groups 

§re distnbjjted according to the exponential distribution (Hastings and 

Peacock, 1974, pp. 56-59) with the following parameters: 

Group ^ Distributi on . Mean '"^ St. ^ev .' 

Experimental ^ P(X.) = a^e '^1^ , ' l/a^ 



4 



-a^x 



^ontxo] P(X^) = a^e'^Z^ ' l/a^ l/a^ 



Now the effect size L equal to 



will estimate, in the case of exponential distributions, 

V 

A « 1/a^ - l/a^ 



l/a^ 



a- - a. 



140 



Suppose that a particular experiment yields sunmary statistics 



as follows: 

« 18 , « 16 ; 

* X(. = 10 , *= 10 : 

<) 

The value of ^ equals (18 - 10) / 8 = +1. If it is assumed that the 
two distributions are nomnal, then the l of +1 has the usual interpretati ai; 
the average person in the experimental group exceeds 84 percent of the 
persons in the control group. Suppose, however, that the average 
experimental group person's score is express'ed as a percentile in the 
control group, assuming exponential distribution in each group. Then 
the percentile rank of X = 18 in an exponential distribution with 
paramenter a^ = 10 is given by 
18 18 



J' P(x)dx - J' 10 e'^°^ d 



^ » .834 



Thus, assuming exponential distributions within essentially 
experimental and conttol groups gives-essentially the same interpretation 
of +1 as the assumption of normaV di strubuti ons (.83 vs. ,84). This 
example is not meant to suggest that the exponential distribution 
IS in any sense interchangeable as an assumption withy^the normal distribution 
The assumption of distribution shapes may be important and it should be 
Checked When possible 5nd the most reasonable assumption made. 
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The cnoice o:f the standard deviation with which to scale'the 
differences betweep group means* to determine l is crucial. Various .choices 
c|n result in- substantial differences in effect size. 

The definitijDn of < A appears uncomplicated, but heterogeneous 
group variances cause difficulties. • Suppose that experimental ^and control 
^groups have means and standard deviations as follovs: % 

Experimental ControL 

\ ' Means * yr = 52 ^ y- = 50 « 

\ ^ w C 

StafWard Deviations = 2 = 10 

Tne measure of ex:)erimenta 1 effect could be calculateil eitr4r^by use ^ 
of or or some ccmbi nation of the two. 

Basis of Standardization ^ . 



a) LOO - ' , 

b) Sj. 0.20 ' ' 
^ - c) (S^ - S(.)/2 0.33 • 

The average standard deviation, c), probably should be eliminated 
as a uere mindless statistical reaction to a perplexing choice. But * 
botn the remaining LOO'and 0.20 are correct ; neither^can be ruled out 
as false. It is true, in fact, that the experimental group mean is one 
standard deviation above the control group mean in terms of the experi- 
mental group standard deviation; and, assuming normality, the average 
subject in- the control group is superior to onTy 16, percent of" the members 
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of the experimental group. How.ever, the control group mean is only 

one-fifth standard deviation below the mean of the experimental group 

when measured in control ^roup standard deviations; thus, the average - 

experimental group subject exceeds 58 percent of tne subjects in the 

control group. These facts are not contradictory; they are two distinct 

features of a finding which cannot be expressed by one number. In a 

meta-analysis of psychotherapy experiments, the problem of heterogeneous 

standard devi ati ens ^was resolved fran a quite different directlcr. 

Suppose tna.t metnods A, B, and Control are compared in a single 

experiment, with the following results; 

Method A Method B Control 
Means 50 50 48 

Standard deviations ^10 1 4 

If effect sizes are calculated using' the standard deviations of 
the "method, then^^ equals»0.20 and equals 2.00 — a misleading 
difference, considering 'the equality of the^-mettfod means on the dependent 
variable. Standardizat/on of mean differences by the control group 
standard deviation at least has the advantage of allotting equal effect 
sizes to equal means. This seems reasoi enough to resolve the choice in 
favor of .the control group ^tandird deviation, at least when there at\e 
'more than two treatment conditions and only one control cofWition. 

. Estimation of L 
— ^ ^ 

• ^ Given that ^ 



0 

y 
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and assuming for tne moment an understanding of which of many possible 
choices of Cy is imolied, the intuitively reasonable estimator of ^is 



^A-B — • 



% ' - (4) 



wnere the sample means are conventionally defined and s is the square 



2 

root of the unbiased estimator of a . Hedges fi979) showed the error 



y 



y 



of intuition with -egard to (4), and he deri.ved the maximum likelihood 
estimator of assuming normality and a single sample estimate of - . 

tffi 



Hedges (1979) examined the stffistical properties of 



. - « - X. 

t-C . I c 



as an estirriator of 



He was able to show that 

E-C (^i^2^^^1* ^2^^^ distributed 
as a non-central t variate with non-central ity parameter 

^ E-C ^^i^2^^^1 ^2^^^^ degrees of freedom equal 
to n2-l where and n^ are the sizes of the 'Samples for the experimental 
and control groups, respectively. Of course, this finding rests on the 
assumption that is normally distributed for both the experimental and 
control groups. ^ ♦ 
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It followed as a consequence of this tneorem that the expected valu« of 
- .is given by 

E-( A ) = t.\K {h^ -1)]'^. where 
K (n^-l) = rtvM " ' 

^ (5) 




Hence, : is biased as an estimator of ^ . The (^egree of bias is a 
function of the ratio of two gamma distributions as can be. seen afccve. 
In Figure 5.2 (from Hedges. 1979). the bias in i as an estimator of A 
is depicted by grjohing the ratio E(^ ) /, against n,-l. As can be 
seen there. ; is positively biased for small n; beyon* sample size n2 
of 20, the bias is 10 percent of less. 

Clearly, anTjnbiased estimate of L could be obtained by multiplying 
1 by the correction factor K (n^-1). Hedges (1979, p. 11) provided a 
table of values of K (n^-l) which is reproduced as Table 5.3,sH^f3tly 
^ modified form with his kind permission. 

) Hedges (1979) pointed out an unexpected and important property 

^ of effect siz€s as estimators. Suppose that one obta'ins a series of 
observations of effect sizes,' 5.. each of which estimates the same 
parameter value i . Assume further that for J such estimates, an 
aggregate estimate is obtained by averaging;' thus 



J 

is estimated by ; / j. 
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Figure 5.?. Ratio of the expected value of the estimated effect size 
O to the parameter value as a function of tl^e control group 

us^s « sample size, ng. IsS^^ 



' ' Table 5.3 

Value of K(n2-1) for to be used in obtaining unbiased estimates of ^ 



n^-l 



n^-l 



n^-l 



~r' 

/ 

0.9G378 

0.96 3^^5 

0.96697 

0.96R37 

0.969C3 

0. 97083 

0.97192 

0. 97293 

0.973R7 

0.97/.75 

0.97'j38 

0.97f>3'i 

0.97839 
0.97900 
0.97957 
0. 9801 1 
0.96062 



2 
3 
U 
5 
6 
7 
8 
9 



0, 

0 

0. 

0. 

0. 

0. 

0, 

0. 



10 0 

11 0 
12.0 
13 0 
\U 0 

15 0 

16 0 

17 0 

18 0 

19 0 

20 0 
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b6/.]9 
7 2160 
79 788 

e/.n;5 

86863 
888^0 
902 70 
91387 
97275 
92996 

93 594 

94 098 
94 529 
94 901 
95725 
95511 
95765 
95991 
96194 



n 

22 

23 

24 

25 

26 

27 

28 

29 

30 

.31 

32 

33 

54 

35 

36 

37 

38 

39 



40 0.981 U 

4 1 0.98158 

42 0.98202 

43 0.98244 

44 0.98284 

45 0.90-322 
4 6 0.98359 
4 7 0.98394 
4 8 0.984 28 
4 9 0.984 60 
50 0.'>^5491 
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Denote this latter estimator by G, as did Hedges. He showed 
that . . G is not^ a consistent estimator of ^ as J ^ - . That 
is, even thougn the number of experiments combined increases, the 
estirator does not necessarily approximate the true value ^ more closely. 
In fact, the estimates can differ' from A' by a considerable amount 
depending on the sample sizes. To Vee this, consider tne example of 
a collection of experiments with 5 subjects per group. The estimator i 
has a bias which results in overestimation of L by approximately. 25 
percent wnen ^our degrees of freedom are used for c . Each estimator : . 
has the same Dias, tnerefore G is biased by the same amount as each' i . , 

As J increases, the bias is unchanged, but the variance 
of G tends to zero. Thus as the number of studies increases, the 
estimator G estimates the wrong quantity more precisely." 

« 

The inconsistency in G'as an estimator of L can be corrected 
^y using Hed"ges' earlier result, viz., correct each estimate by' 
ji\(np-l) before averaging them. , 

Although t is simple, it can present many difficulties in 

both\conception and execution. Many research reports do not contain the 

\ ^ 
means \and standard deviations of experimental conditions. Where there 
\ ♦ 

are mori than two experimental conditions and means are n«t reported, 

• \ 

there is ^ittle hope of eyer recover'ing an ^ from the report. There 

\ / 
are several circumstances of inccnplete data reporting in which a harmless 

\ * 
assumption And some simale algebra will make it possible to reconstruct 
** \ 

L m^sures 



(Hedges, 1979, pp. 8-9; 
notation altered slightly.) 
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1. One knows the value of t and whether or X^ is larger. 

2. One knows the significance level of a mean difference and the 
two sample sizes. 

3. Oie knows Xp , , . . . , and the value of F. 

4. One knows X^ and X^ and the value of some multiEje comparisons 
statistics ^ch as Tukey's g_ or Dunn's or Dunnett's statistics. 

One example worked out in detail should suffice to fllustrate how to 
proceed in thefe general circumstances. The report of an experiment 

contains means X^, sizes of each group (n,, . . . , n ), 

^and an £ statistic. Suppose that is the mean of the experimental 

condition of interest and that a second condition is a control yielding^ 

X . 
c 

The value of the F statistic was calculated by the original 
investigator from the following formula: 

i J 

where the only symbol which might not be obvious il, N , which' equals 
n^ n^ . . • ^j- Under the assumption that the variance Sj, in 

each group is the same, the above expression can be readily solved to 

- 2 • > ' 

obtain S^, the assumed homogeneous variance: 



The effect size follows directly: 

ER?C 



2 

How to calculate ^ when S. is not homogeneous and how to define Sy in 
muUifac;or experimental designs are more than simple technical questions. 
As will be seen later in this chapter, they raise basTc concerns about 
the definition and meaning of ^ ; 

One cormonly encountered ,rr\ethod of reporting results presents 
unique difficulties. Reports sometimes give only the samole sizes and , 
an indication of whether a mean difference was statistically significaf5t 
at a customary level. A conservative approximation to the - can be 
derived oy setting a t-ratio equal to the critical value corresponding 
to tne reported significance level and solving for (Xj. - X^)/ S^, under 
tne assumption of equal within-group variances. For example, suppose that 
a report contains" only the information that tne mean of the n^^ experimental 
subjects exceeded the mean of the control subjects at the .05 level 
of significance. At the very least, then, 

1= — = I 96 _ . 

\2il + J.| 



^ -> /I] n^l 



Clearly, 



- - \ • ^ 

•I. ^, '12 

i 

gives a conservative estimate of the experimental, effect . This small bit 

of *algebra also indicat'es how one obtains l when given only t_ and n, 
and n^: 



^ ' ' (5) 



^ When the r^'s in the two groups are eaual , the effect size is 

simply tne value of the t-statistic multiplied by the square root of the 

ratio of 2 to n^, the common sample size. This calculation permits a 

two-way tabulation in which L can be found given t^ and Such a-. 

table is reproduced as Table 5.4. As an illustration of how it is 

read, consider a study in whicn the means of two groups of 12 persons 

eacn were compared with a t-test and a t-statistic of, -^2.10 was obtained. 
From the table, the vafue of L is +.86. 

The HomoQon eitv o^ Variances Assumption in ^Transfo rmi nn t and F 
Statistics- ^" : " 

. In many studies where the empnasis in reporting is on inferential 
statistics, only pooled information is available about the within-grouD 
variances. Since the statistical tests used in these cases depend on an 
assumotion of homogeneity of within-group variances; the test statistics 
frequently obscure whatever differences in variance might have existed. 

When the results df an experiment are expressed as a t-statistic 
which is reported along with n^, and but without means and variances, 
one can calculate' an effect-size, - , vi^ the formula 

='t(l/n^ + l/n^]'^ . 

. (7)- 

The subscript £_indicates that L is based 
on a "pooling" of variances. Suppose, to the- contrary that the sample 
variances are unequal, and'that one wishes: ^, the mear\ difference 
standardized by the control group (group 1, for example) standard deviation 





en 



Table 5.4. Table-^r Converting t-statistic to effect stze, ^ 
gijea. eqiwr sample sizes, n. 
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0.00 

• 10 

.30 

• 40 

• SO 
' ^60 

.70 
.00 . 
.^0 
1 .0^ 
* .10 
i .20- 
1.30 
1 .40 
1 •SO 
1.60 
1.70 
1 .80 

r.^0 
?.\o 

2.20 
2.30 
?.<»0 ^ 
2. SO . 

2. ^)0 
?.70- 
?.R0 
?.90 

:i.oo 

3.10 
3.20 
3^10 
3^<»0 
3^S0 

3^70 

3. «o\ 
3.90 
^.00 



D 


A 

n 


10 


12 


14 • 


16 




0 ♦'00 


0^00. 


0.00 'iJ.oo' b.co o.oy 


• 


• U ^ 




.04 


.04 


.04 


1 o 
. I C 


1 A 

. 1 0 


• 09 




• 00 


.07 


• 1 r 


. 1 J 


.13^ .12 


. 1 1 


• 11 


• c J 


• 20 


.10 .16v 


^15 


.14 


• 29 


•'?5 


• 22 


.20 


.19 


.18 


• JS 


. 30 


\27 


/.e4' 


.23 


.21 


• *<n 




• 31 


• 29 - 


• 26 


.25 




ft. A 


.36 


.33 


" .30 


• 28 




• 


.40 


.37 


.34 


.32 


• J*! 




'.US 


.41 


.38 


.35 


• oc» 


• S> 


.49 
.S4 


.45 


.42 


^ ^39 
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A 


• 4^ 


.45 


.42 


. ?S 


• ^S 


• 58 


.53 


.49 


• 46 


• 01 


.70 


.63 


.57 


..«3 


.49 


, . o7 




.67 


.61 


.57 


.53 


^92 




.7^ 


. .65 


• 60 


.57 


i*9^ 


^ •HS 


.76 




.64 


• 60 




OA 


.60 


.63 


* .64 


• • 1 ft 


• " ? 


.85 


.79 


• 72 


•^7 


i 1 C 
f • 1 0 


1 • vu 


.89 


.0^ 


.76 


.71 


1 • £ 1 


r • 


.94 


.8^ 


.79 


.74 


1.27 


1.10 


.98 


. ^90 


.83 


.73 


1.33 


LIS 


1 .03 


.94 


.87 


: .81 


1.39 


1.20 


1.07 


\.99 


.91 


.85 


1 .4^ 


u?s 


1.12 


1.02 


^•94 


• 88 




1.30 


I.IQ 


1.06 


.98 


' .,9i 


US6 


1.3S 


i .21 


1.10 


1.02 


.95 


1,6? 


1.40 


1.25 


1.14 


1.06 


.99 


i .fn 


l^CiS 


1 .30 


1.10 


1.10 


1.03 


1.73 


1 .^,0 


1.34 


1.22, 1,13 


1 .06 


1.7^ 


l.SS 


1.'39 


1.27 


1.21 


1 .10 




I .<)0 


'1.43 


1.31" 


I'lJ 




1 .^s 


1 /48 


1.35 


1.25 1.17 


1.^6 


1.70 


1.52 
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1 *20 
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1.57 
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1 .00 


1.61 
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1.36 


1.27 
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a. 19 
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1 .38 
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1.51 
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.37 
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.47 
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.77 
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.87 
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.93" 
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.16 
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.35 
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.41' 
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.47 
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.60 
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.70 
.73 
.76 
.79 
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.89 
.92 
.95 
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1.07 1.01 
1.10 1.04 
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1.23 1.17 
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Assuming = ii^, the ratio of Aq to A can be derived: 
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(8) 



As can be seen in Formula (8) a^. is exactly equal to A when 

-I -p 

variances are equal. The bias in the approximation is negative and no 

greater than about 25 percent when control group variance is less than 
I ^ 

experimental group variance; however, the bias can grow beyond any 
bounds when the inequality in the variances- is reversed. 

As can be seen in Figure5.3, is exactly equal to the surroga-te' 
but accesible, value when vaWnces are equal. The bias in the 
approximation is negative'and no greater than about 25 percent when 
control group variance is less than experimental group variance; however, 
tn^ bias can grow bayond apy bounds when the inequality in the variances 
is reversed. This indicates to us that the a^roximation of ^ via^ 
a t-statistic (or presumably an F-ratio, as well) could be unsafe if the 
sample variance of the experimental group substantially exceeds that for 
the control group. 

■* • A psychological experiment performed by Hekmat (1973) illustrates 
the probrems of this section and concerns of earlier sections about 
choicer the coQtrol group standard deviation and n-on-normality. 
Hekma*Jiompared three methods of treating a phobic against an untreated ^' 
control group. -Ten persons constituted each ^f the four groups. A 
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Behav.ior Avoidance Test and a Fear SiJrvey Schedule were administered 
to each of the forty persons before anq^ after the treatment.. The means 
and standard deviations for the four groups on the two measures appear 
in Table 5.5. / * , . . 

Since persons were assigned randomly to groups, the pretest 
statistics may be disregarded. Noti^ce the wide discrepancies among % 
posttest standard deviations: on the BAT, the standard deviation for 
the systematic desensitization group is more thaglK^ive times as great 
as that for the control group. If the effect 'Slz*e," l , comparing 
the systematic desensitization group against the control group is 
calculated by dividing'by the experimental group standard deviation, 
its value is 4 



5.0 -17.8 
3.39 



-3.78. 



If, on the other hand, the control group standard deviation is used, 

the value of the effect size is 

^ ^ 5.0 - 17.8 



-20.32 



.63 



"An effect size of twenty standard deviations is an absurd figure. 

Suppose that Hekmat h>ad^ly reported t_-statistics instead of 
means and standard deviations. The t-statistic for the comparison of 
the systematic' desensitization and control groups would equal 



5.0 - 17.8 



=-11.74 



2 , 11.889 s 



erJc 



1. 



Converting this t-stdti?tic to an^effect size, assuming homogeneous 
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variances as is necessary, gives a ^: of -5,25. 

Effect sizes that bounce around from 20 to 3 to 5 to whatever 
else depencTing on one or another assumption indicate that sanething is 
fundamentany wrong. In the case of Hekmat's data the problem lies' 
with the meaisurement scales. They undoubtedly would show, upon 
inspection of distributions of the data, severe ceiling and floor effects 
wi.tn resulting asymwetry and ndn-nomal i ty. 

StucVes Witnout Control Groups 

Suppose mat in a meta-analysis of experimental evaluatiors of 
science curricula tha^'typica^ studies -involve the comparison of a new 
curriculum (e.'g.. Science Curriculum Improvement Study (SCIS) or 
Science: A Process Approach (SAPA)) against traditional science curricula 
(?gj:;oup lecture, teacher-centered and orient-ed toward knowledge acquisition 
rather tnan developing inquiry skills).. From such studies, effect sizis 
comparing SCIS or SAPA against Traditional could be calculated in the 
usual way, e.g.,* 

A . SAPA '^T 



where the Traditional curriculum fs thought of as a "control" condition. 

Experiments will e3('ist in wnich SCIS is compared to SAPA and 
no Traditional comparison is involved. It makes no sense to pool in the 
same analyses some effect sizes based on SCIS vs. Jraditiona 1 comparisons', 
some based on SAPA vs. Traditional comparisons, and a thjrd group based 
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orf SCIS vs. SAPA comparisons. ^For if SCIS' and SAPA are both superior 
curricula, their large arid positi.ve effects shbuld not be lumped with s 
comparisons "between themselves whic)) would be small. The problem can be 
resolved by means of control referencing of the effect sizes. Each effect 
size based on a direct comparison of SCIS and SAPA can be broken into 
two effect sizes- that reference the curriculum against a hypothetical 
control group (in this case, th^- Traditional curriculum). 

Assume that there exists some number of effect sizes calculated 
from comparisons of SCIS and Traditional curricula; denote the average 
of these effects by I ^j.. Likewise, denote the average of all effect. . 
sizes gotten by comparing SAPA and Traditional by I^^. A single study 
in which SCIS »nd SAPA are compared' without a Traditional group yields 
one effect s\zq,l^^_^^. We wish to break ^3^.3;^ into two effects, 

- SC fk' ^'^^^ estimate the effect sizes that would have been 

ootained in this study if a Traditional group had 'been "included. 

Two reasonable conditions may be imposed" on . and , ^the 
control-referenced effect sizes: 

^SC-SA = ^'-SC " ^'SA '-^"^ * 
^ 2) ,'3^ - Z . ^3^ - 3^. (10) 

These conditions imply 1) that the observed difference from the 
direct comparison is preserved in the control -/eferenced' comparison, 
and 2) that the error (the deviation of a coptrol-referenced effect from 
the averag* of all similar non-cqntrol-referericed effects) is equally 
shared .between the two re>srenced effects. These two conditfons establish 



a pair of iVi^pendent linear equations in two unknowns that can be solved 
for the two control -referenced effects: - ^ 

4 

-'sc ' ^ ^SC-SA * "sc * ^SA^ and 

SA = ^'sC ■ -SC-SA- ^li' 
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Consider this illustration. In 100 comparisons of SCIS and Traditional 
curricula^ tne average effect size for tne dependent variable "interest 
in^science' is 0.76. For 200 comparisons of SAPA and Traditional, the 
average is 0.48-. An experiment in which SCIS and SAPA were compared 
■snowed an effect size on "interest in science" of ^ r « = .30. 
Tne two control -referenced ef fects / then , are given by 

= (.30 .+ .76 + .^)/2 = .77, and ' 
'^'sA ' -77 - .30 = .47. * ' 

Findn'n^ a Standardizing Variance fo r 
^rudies Without Control GroupsT — 

Among the research reports relevant for a particu-lar mrta-analysi s 
^may be some which pi::pvide experimental comparisons of two^treatment 
conditions of interest (say A and B) but include no control (;,omJition C. 
Such studies, will provide, 'at best, standar^d deviations for the two' 
treatment condij:ions but neither of these is ^appropriate for reasons 
. discussed in the- previous section. An estimate can be obtained however. 
If all stfidies^rhfrich A is compared with C are taken, the observed 
control group standard deviations can be regressed on the observed 

1" J 



treatment A ^roup standard deviations to give: 

V » 

A similar regression can be established for s>. and s„ from those 
studies comparing treatment B wtth control C. Non-linear regressions 
are ^ssible, of course. From a study comparing only treatments A and 
3, trie observed standard deviations s. and s- can be substituted into 
tneir separate regression equations to provide two estimates of s.. 
Tnese two estimates could be pooled to provide the standard deviation witn 
wmcrr, to scale the mean difference (y^-T^)- From inf ormation.from 
otner stydies about effect sizes for A and B against^control tnjs effect 
Petween two treatment conditions could tnen be converted to separate 
effects between the treatment and cortrcl (see previous section). 
Experiments witn quantitative independent variables (time, size, etc.) * 
often nave no untreated "control" condition. (A general approach to 
integrating effects from experiments with quantitative independent 
variables is described -n Chapter Six.) For studies "of drug dosage. 
^ amount of instruction and so on, a control condition of no treatment can 
be defined and included. For studies of an independent variable'such as 
class size, one investigator's control can be another's treatment. 
But each study involves some number of' compari sons of a small condition 
(S) and a Targe condition (L) and yields two means, T, and T, , and two 
Standard deviations s^ and.Sj^. 
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y tne sttndard deviations vary wi>fc the value cf the independent variable^ 
:nen some value of^nat variable can be chosen^as a ^^eference point and its 
standlard deviation used fbr converting all treatment mean differences to effect 
sizes. The problem is to find a way of converting from the observed and 
on tne variable used in a given 5)tudy to an estimate of s^^, the standard 
deviation for the reference oroup on that variable. 

"'•om all studies, the ratio of the observed stancard deviations can be 
p-ec'-essed on tne values of the quantitative indeoendent variable usee in the 
ccrsar^! sen , v'z., sra:*! [S) and large (L). The resulting regression function 



't'^'-^o " ' ^2^ • (13) 

a standard deviation is- observed in a particular study for condition , 

S, the standard deviation ^or the reference condition R could be estimated, 
1' R > S'.'as- 

ip - S3/(b^ * b^S,* b^R) . • (14) 

A second estimate s^^ can be obtained from thb observed s^^ in the 

same study. The mean of the two estimates could be used. (If R < S or R > L, 

the ^egression equation can still be used but with substitutions appropriately, 

reversed.) The observed mean differences (y^ - y, )- can then be scaled to effect 
sizes for the corresponding differences in the value of the independent variable 

^R 



METRIC FOR MEAN DIFFERENCES 

Final Status Score 
♦ 

In a study with random assignment of subjects to treatment and control 

conditions, means can be obtained on a criterion measure Y as Y; and 

■ ■ * T C * 

The mean difference can be scaled to an effect size by the control group 

standard deviation on tnis measure, s . Final status, as the scale'of the 
criterion measure, nas several advantages over derived gain measures such 
as raw and resiaual ca'm scores and covariance adjusted final status scores. 
First, it is pnenomenol ogtca i ly more relevant and, therefore, provides '•esults 
more readily inte-Dretable, particularly by lay audiences to whom a meta- 
analysis rnignt oe aadressed. Second., the variance of the derived gain 
measures ccrta-n corfoundec "measurement error" which can significantly bias 
results. • ^ 

V 

% 

Wnere there are pre-experiment group differences, the use of a post- 
ti^eatment status scale will also be tiased. It is with such biases that 
tne derived gain measures were desi-^ed to deal. That they do not deal with 
tnem adequate ly i s one problem. That they express tne group comparisons on 
a scale different from tnat used in randomized studies with on^y a 'final 
status measure is a further problem for meta-analysis. If the final status 
scale is to be preferred then procedures must be found for converting results 
of studies using other scales to this one whil^ minimizing the biases due •' 
to pre-experiment differences. This paper suggests such procedures. 
Conversions Frem Other Scales 

Raw Gain Scores . If the gain score from a pre-experiment measVe 
(x) to a post-experiment criterion measure (y), for person i in the control 
group is: ■ 
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it is^obvious that>e mean gain is'sirnply the difference betweer the post 
exptrffnent mean (7.) and the ?re-experirr«n:' mean (X,). The difference 
•.between featmsnt and control 'croup meSns gains will be: 

ror tne coir.outa tion of an effect size on the final status scale the 
mean ci^erence required is (Y- - Y,)] it is better, however, to use 
(■S- - G.,. If tnere are no pre-treatmert. di fferences between tne groups. 
I.e., (X.- X^j = 0, the two will be identical anyway. If there are ore- 
treatment- di-fe-ences, as tnere often are in studies in which gains are 
resorted to, tnen [r^ - G^) nas .the advantage tnat it is not contaminatec 
so CTrectly by tne pre-treatment differences. 

Resijual Scores. The residual elfment of tne final status score, 
^or person i in tne'control group, unexplai nable frorr, that person's status 
on a second variable X is: ' . 



(18) 



The DESft- difference between treatinent and control groups In residual scores 

wi 1 1 be : 



(19) 



Again, altnougn the mean difference of interest for the comoutation of*an • 
effect size on the final status scale is (Y, -.7^). it is better to use 
(Cj If there are no pre-^xoeriment differences, the two will be the 

ft 
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same. If there are, as there often are in studies in which residual scores 
are resoriea to," tnen (c- - g^) nas tne aavantage tnat'it is not contar.inated 
so directly by the pre- treatment differences. " 

Covariance Adjus ted -Scores . Since the covariance adjustment of final 
scores is conceptually similar to the computation of residual final status 
scores, the same Dcints may be made. The adjusted group means for AIJCOVA 
will be the'g" in the previous section provided that tne residuals there 
are computed using a regression line through the grand centroid (I_ , 7 ) 
wiin a poolea within-group estimate of slope. 

Use of the regression line fitted to tne total bivariate distribution, 
ignoring grouc membership, is inappropriate. If there is a treatment effect 
wnich snifts the relative levels group perfonr.ance on Y, unpredictable from, 
tneir relative positions on X, this treatment effect will be in part 
removed in the comput-ation of the residuals. Use of a regression line "through 
tne grand centroid witn a pooled^wiJ:hin-groi;p estimate of slope removes 
only those final status' differences attributable to p^ior status differences 
and none due to treatment effec-ts. If the total group regression line has 
been used in a study it will be difficult to include its res-ults in a 
meta-analysis unless the prior and final" status means are provided. 

The difference ^etv/een the covariance adjusted group means then, ' - 
wi n be : 



•I 9y - g 



(20) 
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Acnievinc "Corriparabi 1 1 ty" When Tne'-e Are Pre-Treatment Group Differences . 

ihe^uses df.gain scores, residagl scores, and covairance adjustnv»nts when there 
.are pre-Sxperiment gfoup differerKes are artempts to rendap^thfi groups • 
•comoarable. / In^^ajns'ta-analys'is there is a different problem of cqmparabiy ty. 
*If thtfe are no pre-.treatm^t differences, then mean differences computed 

between Groups w'i IT 'be ths.4 



ime whatever the scale. That is; 



-. • ^ - - Y,) ='(G,,- G^= (9. - g^) = ■(-, -?^) (21)' 

The cnoice Cf scales will influence the -estimate s , of course, even 

where it does not affect tne mean diff^rervce. -Where'thertf are pre-treatment 
" ' * 

mean differences, then it is inapproprTate to use (T. - T.); but the auest'iiS^' 

/ - - '. I «■ L 

, is.wmcn of the others to* use. Some st(/di.es to be' included in the meta- 
analysis may report result^i.th gain scores, oth'ers may report residual ' 
or covariance^justed scores. Tfiere seems to be no- a priori- re^fes for 
prefernn-g one to. the other. It is a choise that the reviewer undertaking 
•a particular meta^af^lysis must take and shoutd report. Consistency is 

I- important. Results on pne sc'ale can be. converted to.J-^e^ther t/singf 

» • - # 

■ ■ ' " ^9t - 9c),-T>>"liy.^)tX.- T ). iZ2) 

^^^^^^^^^^an differences^ used for the . computation of effect sizes^will th^n ■ 
•all be.eithef (7^ --7^) or the same variety of adjusted group differences.-* , 
uSed as an approximatiorr of the final' sUMiS, differences for "initially, 
^comparable" "groups. Jhe for^ of the mean difference shouljJ be recorded 
so tha't any systematic d-ifferences_ in 'effect sir§ related" to^e form of ' ^ 
'its c^culat.i^n can be revealed- -J. t ' " * 
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* Choice of Standard Deviation for 

Scaling Mean Differences • ••* 

The cnoice of tjbe stanaard aeviation wi^n whic/i to scale the dif- ^ 
Terences between group means is crucial. Variations in chQice can be ilif^lected 
in subsfantial differences in effect size. Recording the^^pKc^ce made >n 
each case can 'allow the investigation of any systerrfetit interaction between 
the choice ^nd^th^ effect size computed but, unless the relationship* is i 
sirpple, other Important relati'onships with effect size may be obscured. 

For mffst problems, it'^ems preferable to standardize 'group mean 

differences by the standard deviation of the final status variable, not by 

fhe standard deviation of some type of gain, change or residual score. The 

cnoice of a standardizing metric is hardly triv-ial. Consider an experimental' 

study in which pretests and posttestS' were adrwnistered and in which no 

pretest mean differences existed. Suppose further that the pretest-posttest 

corr^t:on is '.75, the posttest mean differenc^e is 10 points ahd ^he . 

posttest standard deviation is 15. The effect size, A , in^^erm of the 

— y 

final 'Status measure is: . / I • 

» - - 

A ^ 10 



tc ' -67. 

y 15 



As will be seen below, the standard deviation of resiaual scores in th^ 
'instance is 15/1-. 75^ *= 9.92. Hence, the effect. size in terms of the * 



metric of'^resi dual $ cores is: * 

V . 

• A _ 10 



r 



=M-.oi 



Obviously 'the choice of metric malces quite a difference In the calculated 
•effect. Neither calculation is wrong; they merely reflect alternativS 

mc'. .... ^.7.') .. 
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exDP^ssions of tne'^G^nerdi phenomenon of tne experinental result:^. No'* 
. rigid rules about Jticn metric i§ best would be advisab le^^bjt^ne metric * 
of^the ^inai status measure seems preferable. Final status (i.e., "postt&st 
score") IS a pnerromenon more-readiiy perceived and experienced than cnange \^ 
or gain; nence, the expression of results on tn| sfale of final status ^s 
p^nomenologically 'more Irneaningful . In addition, the)^e are several ways to 
measure change or gain that 'are equally good, or bad (Cronbarh and furby, ^ 
197C). * ''Simoie gain," "residual gain," "estimated true gain," ant^ others; 
eacn nas a different varfance and woiil-d 5.ive a different value of ES. It 
seems better to avoid them all and stanaardize group meah differences in 
tenns final status. ' ^ i 
^ntr&} G^?:uD Starcard^ Deviation, oh Final ^Status ^ 

" Direct JJse of Control Group "Standard Deviation ." Where the standard 
deviation for a control group on* fi-nal status scores is available it should 
be used. "The relative effect of treatment with respect to no treatment can 



then br-:rea(^i4^ described in terms of the distribution of scores for ^untreated 

suDjepts. Of course, separate effect sizes could be estimated 4JSing both 

contro-T group and experimental group^ standard deviations.. These effect " ' . 

Sizes n^ed different interpretations since they express the mean difvferences 

in terms of different distributions, 'The most str-a^ightforward procedur^e 
* « 

-*is to use the control group di st^- ibuti on as the point of reference^ 

* # 

FoP casers in which the treatment and, control group standard deviation? are 

not homogeneous, the treatment group standard deviaticm will vary with 

thB* nature of the treatment.' Attempting to Keep\track of suc^i" Variations 

thVough analysis and interpretatipn will unnecessarily complicate the an^lysisr 

"7 



6 



4 
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Retrieval From Starvcard Dylatlong or. an Adjusted Metric 

In the preceding disc^ussion of choice of standard deviation, all 
stancard aeviations weve taken to be expressec (3n tne metric of the final 
status scores. If those scores have been, adjusted in some way, the 
standard deviation on -the final status metric neecs to be retrieved from 
tnat ojj^he adjusted metric. Pr^jcecures^r maki ng such adjustments are 
aescribed in this section. 

Raw 3c in Scores . Wi-th raw gain score aefiqed oy (.16)" the variance 

J. ' ■* • 

cr tne raw gain scores can be shown to be: 

. _ I 

- cl - 2p — • (23) 

wRicn, i' It can oe assumeo that r = c , reduces' to: 

X y • . . 



y. ' xy 



(24) 



tne control grou: standard deviatiorvjs'.pr'ovided ir terrns of raw gain 
scpr.ps-a3 s., "its standarc aevia^ion on the final scores fan be obtained 



In many studies ^porting in tenr.s o' rat gain scores, no infonr^tion .is 

proyidec aoout the correlation betweeh'tnl two status me'asures., It is 

also iTOortant to note that the correlation reouired is r for the control 

xy 

group or, "at least, a pooled witnin groups estimate of "it. If the cor- 
relation is -no^t^prnvided, a reasonable guess* can probably be rr^de if some- 
thing^ is known Vout ttie tests involved. For standardized tests, a published, 
test-retest r^l-iabi 1 i'ty might be appropriate. - , ■■ 

[] -^esid^al Status Sco-es. Witn resicual status -scores defined. by (ig)- 
tne variance of the resiaual scores can be-snown to' be: 



1 • - ^x/ . (25) 

witnou; any necessary assunption about eq<llity of c and-; . ■ . ' 

X y 

If the control group standard deviation is providedjn terms of 
residual scores as s^. its standard deviation on' the final status scores' 
can be obtained from ^ ^ % - ^ 



VVT^^iF^ • ^ (27) 

xy , > 



In-^orrat^cr about tne correlation between scores on :ne twc status measures 

IS more nkely zz oe provioec in, stucies using resicua' scores' man 

stucies '•using' raw gain scores. , The correlation reoui-ec is the pooled within 

group correlation not tne control group correlation. Since the resicuals 

» . , 

^ are calculated using a^ pooled estrmate of slope, and not seoarate group 
estimates, it ts with the pooled estimate of correlation that^lne unreduced 
% standard 'Deviation can be recovered. If the control group standard deviation 
on residual scores, s^, is available it should be usetf^rathe- tnat pooled 

_ estimete. ' ■ 

Cdva'-iance Adjusted Final Status Scores . One effect of cOvariance 
^ adju'stments is to reduce 'the wi thin-group'stanJard .deviation in a manner 
similar to that described for residual scores." If tiie standard deviation, for 
the controT group on^tni residual s^oras is given, the standard deviation 
• for the final status- scores can be estimated using formula (27)." 

If only the covariance adjusted pooled within-group mean^uare, MS' , 
15 known a pooled estimate of the within-group standard deviation on final 
status scores can be obtained from: ' ,* i ' 




MS' • (df - 1) 

" -7=^ . ' -(28) 



- ) (df - 2) 

EMC- 
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retrieval rror Hlcner Order Factori al Designs 

Many experi.'nerta-l compari sors of a treatment and a control conc'it'ion 
use r.cre comolex cesicns tnan tne.sim^le ccrr.oar^son of two grouos. Some 
' introduce otner factors into a higher- orae^- ana^lys is of variance design to* 

examir^e inreractions . In the process these ae^igns, create a new *def1ni tion • 
of with^in-iel 1 variance. Others introcuce stratification of subjects 
. (matcning c' pa^rs oeing an extreme exarr-'ie ) -educe' the error var-;ancd|f 
inc Qzza^n a fr.zrz 20wer*j'; -s^ ^■■ca:>ce test, "he Lse c* reseatec measures 
designs in wr.icn subjects are rr^atcned tr -nerrselves is -ntendec tc acmeve 
even rio-e powe- cy tne same means. 

• / 

Ir, repcris c*' studies of this type, only the pooled in'prmatTon" in 

,f 

analysis va-iance tacles -s proviced. Means must be foun/to retrieve 
a- acp-op-iate estimate c' tne cont-ol g-oup standard ceviation. 

Acc^t^gn^l -acte-s Tn eoret-cay Interest . a higner crce- analysis 
•X' ya-iance'is used ts explore interactions oetween tne treatment anc otne- : 
'actc-s, tnat in'crrr.aticn snould not t)e Ic^but sfiouic instsac be coced into 
tne meta-analys'is. It is just sucn interactions that meta-analysis m>ay 
nevpal oetween studies. Any results wnicn, reveal such interactions- witnin 
stucies^sho-ic oe preservec m tne data ^ tne .meta-analysis. For example. 
V / a stucy tc ^oTZJare treatment and control conditions (Factor A") may stratify 
tne sample of -subjects . into m,ales and females (Factor B) to study the 
interacti-gn of tne tr'eatment witn the subject's gender. For an e'fecfsize 
base^l^n tne difference between the overall treatment and control means 
'''7 ' •■: -'"IS appropriate standarc deviation would be that for the totil 
control group. A pooled estimate- o' tris woulc^be given, by: 



s * 



--w^ 
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^ l/^ [dU - df,. ♦ d' ) • ^ 'N 
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An effect si-ze for males alcne would be based on the mean difference 

(7^^ • X^^, ). The appro;yf^ate stanc^rd aeviatio'n would be tne one for tne 

control groaD rriales for wnich a poolec estirriate would be given by. 

« /RT" . (30) 

St^ati'^icatiop on a Continuous Variable Correlated witn Outcome . I n - 
some studies suojeits are' strati fied on a continuous variaole whicn is cor* 
related witr. tne ^^inal status measure. Tn-is cesigr allows tne witnir cells 
surr of squares f^om tne corresponding unslf?at|'^iec Qesign to be partitioned as 

W ^ r ; t , n"^ ' W ^ k3 ; 

as ^0** tne case where B is a factor ^f tneoret^cal interest. Altnoucn tr.is 

oesigr also allows a^more" powerful test of tne treatment e'ffect, there iS| 

/ \ 

usually no substanti ve" i nteres t in the between levels variation or the 

■ . J 

treatment Dy levels interaction. The control ^grou? standard deviation 
"snculc De-Obtained as the pooled estimate in formula (29). 

If tne strati ^ication^^>4^cnieved Dy m^atcning pairs, there will be 
no SS^/^^, 'term. Only the terms SSg and SS^g will exist to be pooled. Where 
tne matched oairs aata are analyzed by a aependent groups ^- test,, the 
standard error o*^ tne mean di^^rence between pairs is 

/ - - - ^^^^ 




Where Or and r are. tne standard deviations be the treatment and control 

' C _ * 

groups', is, the correlation between pairs and^n is tn\^number of pairs. 
, the standard aeviations for experimental and'control condi tions^are 
eassumec to be homogeneous, then (32) oecomes: 




^* me standard er^^or of tne mean difference between Dcirs is reported, the 
contrc' gr^^iD stancard cev^iation on tne "^inai status .measure can De estimated as; 



. 'n ^ 
s . ' s- i/^. z — - ' • ^ (34) ^ 



i w 

Since tne cori-el afi on between pairs, r^^, will proDafciv not be 
reDCrtec it rnu^t bef estimatec. The matcninc will have been con^e on some 
variable t fpeasu'^ec befqr^ tne experiment. Tne Da*"tiai correlation cf scores 
on tne outcome measure Y'^etween memoers of pairs, controlling for tne common 

cn P^^, will be; 



A sco^e ror memoers o* ea 

Y 



I*^ tne correlation between X anc y is the same eac> group, that is 

V 



My V 



» — ^ — • • '35; 



(1 - .^y; 



and, therefore. 



» 

» » • . . 

If all that members of a pair nave in commo^Kcan be accounted fo?^ 
by tneir conmon icores on tne matching variable, then the partial correlati^on 
^ between their scones on any other variaofe^ oartiaJ inj^^put their scores on 
, ^tne matching variable^ snould be zero. A reasonable estimate of tne 
correlation between oairs'on the final status mieasure tnen would be; • 

ERslC \ V ^ ^4 <^JS,i ^ 



V 



« 

• *-i^-^v (witnin croup) :£ noi proviae: :n tne report, s ^esonaole guess :ar. 
oe maoe sonethinc :s Known aout tne lests in/O'vec. 

Strati '^ica.^tio n on a Continuous Va^iao^e 'heoret'ca* Inte^-es:, >, 

' ' . ; * 

some stuGies stratification on a conrinuous variable rr»ay be used to introduce 
a factor in wnicn tnere is theoretical interest. For examole, In researcr 
on ability grotjcing some studies test only overall mean perforrfiarces of 
5tuGents taugnt in nonogeneous c'^ouos and srucents taugntjn neterogeneous 
♦ 5i:ouos, 'Otner stucie^ exarr.ine, as well, tn^oijs iDi 1 i tv o' ci fferentia*! 




eness, o^esent' nc» anc tesrinc tnr siori'^icance of c:.f^*erences oetweer 
fTorrcceneo-s anc nete^ogeneousV/ grouoec stuoents at various levels o*^ acilitv. 
'Effect S'^zes car oe estimatec ^or ootn tne overall mean di f'f'erences and 
^ tne mean ci f'erences 'at ^i^'ei^ent ability levels. The question i^s, however, 
whicn standard aeviatlon shouT? be'^usec to scale tne mean differ|nces at 
specific aD^'lity levels — tne total -control grouD standard deviation (or a 
ooolec es'tirate of it), or tne stancarc aevi ati on .for tne suO-test of tne 
control group -^kt tnat "level (or a pooled Sstimate of it). " " 

The cnoice^ will depenc on'botn the interp^'etation to be made of tne 
effect sizes ^^c tne extent of aggregation of eff|fct si'fes. If mean 
effect sizes over all levels are to be comouted, or if effect; sizes for 
vai*ious ability -^vels are to be comoared, they should be scaled in terms 
'of tne standard deviation o^ tne w^ol^e -Qontrol group. If, ^^rom the analysis, 
it emerges tnat there are different effect sizes for different ability 
levelsr.riew effect size 'estimates based on the control group for each 
•oarticular level can be calculated. These effect sizes will beiihdices of 
' the efftca^y of treatTCnt^at a particular aoility level' wtin reference to " 
the distribution of' the scores of the releva?it' untreated groups at t-nat level. 

ERIC. * , 
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1- a st'jGv presents cata cn'iy a part tne :crai d:str^.DuVton 
w%r. be necessary esti.Taie tne stancarc oeviation ^zr tne whole 
control DODalation from >ne a>/cilacie stancarc aeviation for a truncated 
section 't. Ctherv/ise tne effect sizes calculated will vary accorCTnc 
to tne nomoaeneity of tne "ncatec portion used. For this .estimation, 
1 nfcrmati;*^ be required about the correlation oetween the stratifying 
variaole anc tne ^^inal Istatjs measure anc tne selectivity cf tne suD-grouc 
on tne grouping variaole. ^ ' 

Ar alte'^nttive tc esiiratinc tne total cont:-o! group stancarc ceviation, 
noweve^> wou'c be tc use tne i^epoi^tec stancarc cevi^tions anc to rate tne 
extent 0-^' tne truncation c^ tne cistributicn on' a ci^uce tnree tc five Dornt 
scale. 'I'nese '^atings coulc be correlated witn tne effect sizes to aetennine 
wnetne^ tnere is anv relationship, • ' 

reseated Measure; Ana'yses . -Where the treatment and control conditions 
a-e sue- tr.at tney can both be aaoliea t; tne same sarrpl^, repeated measures' 
aesi5"s are soTietimes used to avoid inter-sjtj^t' ^anabil ity between ^grouDS. 
>. tne simplest case, wne^e treatment is one factor (A) and subjects" tiie other 
'S;, tne error term, for* testing the significance 'of tne difference between 
tne treatment gro;jp .means is t^^e AxS interact:ion -mean square, ^^n estTm^te 
of tne aporopriate control group standard deviation caFi 'be obtained if the 
sums of sauaresv for S and AxS are oooled. Similar approaches tc pooling 
can oe 'usee for mixec model oesignsMn wnicn suojects -are 'nested unaer some 
additional factors but drowsed wUh treatments. 
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TRANSLATION OF SIGNIFICANCE LEVELS INTO EFFECT-SIZE 

Imagine that in the report of a study it is recorded only that a 
particular tes^^tati stic (e.g. . "t. or. F_ or Fisher's ^transformation of r) 
was calculated orl n cases and that its level of significance (i.e., taiT^v 
area un^er the null hypothesis) was D. How can one transform this meager 
infomt^n into a measure of. effect size or correlation? Provided that the 
o-value wa« reported exactly and not rounaed to coarse approximations such' 
as .05 > p > .01 (in which case some very crude conventions must be 
aaoDtec), tne transformation is straightfor>^ard. Lf, for example, it 'iS 
reported that a twa group t-test with n ^ = n^ ' 6 was •significant at the 
p = .02 level (,two-tailea test), then it is a simple matter'of looking 
up the value of t in a t-table: 



.99^10 ■ 



Thus, one knows n^ and tne value of then-test; hence, one can 
proceed to via the conventional steps derived and illustrated elsewhere: 

. / 



"2 "2 



2.76 A . I 



■ '■ • 1^59 . . • • ♦ 

\ 

The reasoning and methods are similar for all of the other test- * 



statistics for vfhicn we have derived transformations to. r^r :,(see Glass, 

ill 
-4 



1 977; Smith, Gla^s & Hijler, 1 979; and , the first and jecond;quarterly ■ ^ 



reocr^ts). A sligh: complication may arise at this, point. Some investi- 
gators- attempting an int^rative ar^lysis have routiaely transformed any 
D value into its corresponding un.it normal deviate z, then into an l or r.. _ 
The transformation via 2 introduces sn-^ll errors into the resulting estimates; 
when tne particular test statistic on which based is known, then it is 
more accurate to transform via that statistic. For .example, in the il- 
lustration above with D = .02 and n ^ = - 5, the transformation /ia z , 
(wnich- essentially ignores the "degrees of freedom" ^roblem)^ givek the ^ 



'01 lowing estimate of ^. . \ 

* 

.99 



* 

.z = 2.325 



= 2.325 J\ - I 
= 1.34 . 



The earlier estimate equaled 1.59; the error introduced by transforming via 
z instead of 1 is over 15^ of the value of l • 

Aside from this mihor coi^pl i cation , the transformation of £ values, 
• given n, into ^ or r is rather obvVous, and- it proceeds by means of 
conventional- statistical tJbl,«s of significance levels and formulas pre- 
viously developed for transforming test statistics.. " . ^ ^ 
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TRANSFORMING NON-PARAMETRIC STATISTltS 

Suppose that a study involved the test of aj^ull hypothesis about 
equivalent locations of two di stributiorrs , and a 'M^nn-Whi tney U-test 
was performed and reported. The U-test competes with a norma\-di stri- 
bution t-test of means irr these circumstances; the U-test was once ^ 
popular because it was believed to be safer when parametric assumptions 
were violated. The safety proved largely illusory, and today the t-test 
is the method of choice. But many studies reported U-test results,, 
and it is necessary to consider how information about L , say, c^n be ^ 
retrieved from them. 

No .simple transformation of IJ into i is*possible Since the IJ-test 
and most other non-parametric^ tests do not test simple hypotheses about 
population means. However, one could^substitute for the reported l[-s*tatistic 
the value of t^ that Kas tpe equivalent level of significance. For example, 
with n^^ « rig = 10, a U = 23 has a two-tailed significance level of £ = .05. 
The corresponding t^ is 975.13 * 2.10, From this t^-statistic an l is ^ 
found in Vr\e conventional manner: ' ^ 



« .939 . 



'1 '2 



The above series of transformations appear sensible and adequate, 
.but one refinjement may be possible. Nonparametric tests are known^\o have 
less power 'than parametric counterparts where. the latter exi^t.- fhus , a ^ 
U^-statistic significant at the £• .05 level probably corresponds to a 
^-Statistic that is significant at the .03 or .02 leveK For example, 



it is known that in many ci rcumstancs^ the power t)f the U-test is about ^ 
9S% as large as th» power of the t^-test, a situation illustrated below: 




The area tc the right of C under the curv» H, : t is p^, the power of 
the t_-test against the particular alternative hypothesis illustrated. The 
area above C under. : •llu) is p^, the power of the U-te^t. It is genera\ly 
true that p^7p^ = 3/tt as n_^a>^ (Mood, 1954). -Now suppose that p^ is . 
approximately .94 .in'a particular situation. Then the corresponding power 
of t is p'j7T/3) = .94( 1.0472) =".984.' For large and n^, the values of U 
(appropriately standardized) and t_ that cut off 94% and 98.4% pf the a^e^ 
under roughly normal curves are 1.55 and 2.. 14. Hence, the small 5% dif- 
ference in power gives rise to quite large differ;ences in test statistics 
and, hence, in approximations of a's or r's. .The prevalence and importance 
•of these differences depend on the relative powe>-s of various non-parametric 
and parametric tests. * 
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9 

TRAN'SF OP-MING DICHOTCJIOUS OUTCCME 
VARIABLES INTO EFFECT 5I2ES 

J 

Experimental outcones are frequently measured in crude dicho- 

todies where' refined petfic scales do not exist: dropped out vs. 

persisted in school . Remained sober vs. resumed drinking, convicted 

•vs. not convicted of a crime. It seems inappropriate with such data 

to calculate means .and standard deviations and take a conventional 

ratio. One approach to this probjem is tp attempt to recover under- _ 

lying"but unobservable metric (e.g. . motivation to stay in school), the 

experimental^ and control^groups are distributed normally as in Figure 5.4. 

It is assumed. that there\xists a cut-off point. C^, such that if motiv^- - 

tim to stay in school fal Is. below' 4 . the pupil will drop-out. What can 

be observed are the proportions and P^. of the groups which fall bel a. 

C . Under the normal d-istributi ons assumption, 
X 



-"1 



dz.- 



wnere 



(39) 



z = ■ 



' • Clearfly. is simply the standard normal deviate which divides the 
curve at th«"lOOP^th percentile and can be obtained from any t^ble of the 
ncrmal curve. Likewise, is that value of the standard normal variable • 
which cuts off the bottom lOOP^ percent of the distfubuti on. ' Since. ^ 
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and 



Control 



Experimental 




n.can be shown under the assumption of hcniogeneous variances that 



".Thus, effect-size measures on hypothetical metric variables carv* 
ba^btain^ simply by differencing the standard normal deviates corresponding 
to^the percentages .observed in the experimental and control groups, ^e 
reasoning followed Wre essentially the same a's that which underlines 
probit .analysis in -bi metrics (see Finney, 1971). Whe»^e the unobservable 
metrfc distributions ought to be assumed skewed in an '^pected direction, 
the methods of logit transformation win be more appropriate (Ashton, 1972). 
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Table 5'.6 
Probit Tra-rts format ion of Difference 
In Proportions to Effect Size 



.05 
.10 

:i5 

.20 
.25 
.30 
.35 

. .45 
.50 
J .55 
^ .60 
.65 
.70 
' .75 
.80 
.85 
.90 
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.05 
0 



lOf .15 .20 .25 .30 



36 
0 



\ 



.60 
.24 
0 



.80 
.44 
.20 
0 



.97 
.61 
.37^ 
.17 

0 



1. 12 
.76 
.52 
.32 

. . 15 
0 



.35 


.40 


.45 


.50 


.55 


.60 


.55 


. 70 


. .75 




.80 


.85 


.90 


.95 


1.25 


1.39 


1.51 


1.64 


1.77 


1.89 


^.03 


2. 16 


2.31 


2 


48 


? 68 




J. CO 


.89 


1.03 


1.15 


1.28 


1.41 


1.53 


1.67 


1.80 


1 .95 


2 


. 12 


2 12 






.65 


.79 


.91 


1.04 


1.17 


1,29 


*1.43 


1.56 


1.71 


1 


. 88 


2.08 


2 


C , DO 


.45 


.59 


.71 


.84 


.97 


1.09 


1.23 


1. 36 


1.51 


1 


.68 


1.88 


2. 12 


2 - 


.28 


.42 


.54 


.67 


.60 


.92 


1.06 


1.19 


1.34 


1 


. 51 


1.71 


1 95 


2 11 


.13 


.27 


.39 


.52 


.65 


.77 


.91 


1.04 


1.T9 


1 


36 


1.56 


1 .00 


2 16 


0 


.14 


.26 


.39' 


.52 


.64 


. 78 


.91 


1.96 


1. 


23 


1.43 


1,67 


2.03 




0 


.12 


.25 


.38 


.50 


.64 


.77 


.92 


1. 


09 


1 .29 


1 51 


1 RQ 
1 . 0 J 






0 


.13 


.26 


.38 


.52 


66 

• \J\J 


. XtJ 




Q7 


1 1 7 
1.1/ 


1 A^ 


1 77 
1 . // 








n 
U 


1 0 

. 13 


.25 


. 39 


.52 


.67 


• 


84 


1.04 


1.28 


1.64 










0 


.12 


.26 


.39 


.54 




71 


.91 


1.15 


1.51 












0 


.14 


.27 


'.42 




59 


' .79 


1.03 


1.39 














0 


.1/ 


.28 




45 


.6& 


.89 


1.25 
















0 


.15 




32 


.52 


.76 


1.12 


















0 




17 


.37 


.61 


.97 
















* 






0 


.20 


.44 . 


.80 








• 
















0 


.^4 


.60 


























0 


.36 
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The tr^s^omation of dicnotcmoJs information to metric information 
via^^obits or logits makes Jt possible to expand greatly the data base ' 
of a meta-analysis. 'Frequently, studies on a single topic will encompass 
both metric and dicho'tomous measurement of outcomes. Haying to integrate • 
fdndings separately by type of outcome measurement is inconvenient as well 
as less fhan the broadest, most ccmprenensi ve j'ntegration of research 
possible. 



Taole 5.5 provides the the rapid- cal Culati on of L given n and 

^e 

O f ,For example, suppose that p = .60 and p = ,C0; f rbm 4ne table, 
the value of L is ifiur\6 to be .50. Supoose, as a second illustration 
th^t p^ = ..35 and p^ = .70. Then the sign of 'tne effect size wil^Kbe 
reversed after referencing Table 5.^ with .70" for columns and .35 for* 
rows : -.91 . . 

Several minor technical problems have arisen in connection with 
this technique: 1) what should be done when the distributions underlying 
the dicnotomies are not normal?, 2) what if the two distributions (that 
giving r\se to 2^ and that yielding p^ ) have different variances?, 
3) how does the probit transformation compare to treating the dichotomy 
as an ordered metric and simply calculating A-- '(Pg " P^),/ »^P^( ^ • P^) 



4) how can a probit -transformation be carried out when p equals either 
zero or one? j 

Non-nomality . 

We have examined alternative underlying distributions that could 
serve as a basis 'of $ transformation method like orobits. Two distributions 
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seem particularly useful: a) the logistic distribution, and b) the beta 
distribution'? Their proDability density distributions are as follows: 

Logistic: P(x) = {sech^ [(x-a)/2kj }/4k 



• ' Beta: P(x) = [x'^" (r-x)^ ]/B'(v^,w), where B{v,w) is the beta- 
function. 

The logistic curve has' slightly "thicker tails" than the normal 
distribution to recommend it, it is a symmetric curve, slightly more 
peaked in the center and thinner in* the intermediate regions than the 
normal. The following comparison or ordinates makes these features clear: 



2-score 



Ordinate of 


-4 


-3 


-Z 


-1 


0 


Normal * 


.0001 

4 


.0044 


.0540 


.2420 


.3989 


Logistic 


.0013 


.0078 


.0458 


.2185 


.4535 



Although these differences in ordinates appear smiall, they yield large 
differences in estimated effects when transformed firH to percentiles then to 



2^-scores. 

4^ 



The beta distribution is a l^rge family of curves bounded between 0 and 1 
fcjr the yariate x and encompassing symmetric a'nd asymmetric curves--of widely varied 
shapes. The beta ^istribtjtio^ for v^ ^ 4 ar>d w^ = 2 is depicted below. 
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Figure 5.5 Probability density function for the beta 
variate 6: V, w. 



erJc 



By changing and w, the beta distribution can be given ^any desire'' 
skewness. Thus, it is a useful distribution for describing asyrrmetric 
variables. Furthermore, its percentiles have been extensively tabluated 
(Pearson and Hartley, 196^). ' " 

We applied, where appropriate, probit ^transformations and metric 
calculation of effect sizes on a btfdy of ^literature Tn drug therapy and 
psychotherapy. The discrepanc|T)etwe$T^ the average effect sizes for the 
two different methods proved to be relatively large, as Table 5.7 belcw 
reveals. . ' * ^ ^ 

It must be emphasized that the comparison in Table 5.7 is ba'sed on 
two sets of data not necessarily et^uivalent in aU important respects. 
However, the direction of the difference (favoring the probit trans format i.qn 
by nearly two-tenths standard deviation units) is consistent with the . 
expectation thaf 'viol ations of^the normality assumption of the probit 
method are liKely to inflate effect-size estimates, f^articularly. where * 
dichotomies are extreme (.95 vs. .05 or worse). ' ^ - 
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Table 5.7 

9 

Comparison of 'Average E-ffects Calculated by Either Probit 
Transformation or Metric Statistics From 112 
Experiments on Drug and Psychotherapy 



No. of Average 
Method * ^ Effect Size, a 



Probi t Transformation 53 .651 

Metric Statistics 351 .494 



Heterogeneous Varianrp. Suppose that one observes p^ as the ' 
proportion of cases exceeding some fixed 'point, C, on a scale of measure- 
ment for which is normally distributed with mean and Standard' deviation 
Pe and The quantity p^ is similarly defined with.Z^ having mean and 
standard deviation and a^. Now if and p^ are transformed into the 
unit normal deviates, and z^. that cut off the. upper 100p^% and lOOp^X 
of the normal curve, then: 



C - Ur C - 

z ' „ and z = 1 

e c 



It is easily shown th^t: 



c 
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the mean difference standardized agains't th^control group standard 

deviation. If one knew the value of o^/c^ or had a good hunch about it, 

tnen a could be easily calculated by weighted z by the ratio o /o 

e e c* 

But it is more realistic (because Og/o^ will nearly always be unknown) 
and important to ascertain how is affected if o and o are unknown' 
and neterogeneous. Beginning with z^ - z^ ,and permitting o^ and o^ to 
differ, one quickly arrives at the-express^on: 

- z = ? ^ + -(40^ 

c e 0 0 0 0 

e c c e 

It is interesting to note that this expression depefids on C, the 
hypothetical cut-off po'int used in determining "success" in both the 
experimental and control groups. The equation has not worked out to any 
form that is particularly neat 'or useful. There is probably little point 
in pursuing it much further. It is sufficient merely to record that 
heterogeneous variances affect the^probit transformation both through 
their effect on the mean difference and the value of the criterion, score. 
One is advised to be alert to the possibility of unequal variances and to 
use a transformation such as z^ - ^^(Og/o^) when possible. 

Probits vs. Djchotomous Variables . It has occurred to some tD 
ask whether the probi t transformati on of^two dichotomies is roughly 
equivalent to treating the dichotomies as merely a limiting case of an 
.'effect size from the manifest variable, e.g., 

^ = . . . . (41) 

This expression is siJiply the mean difference between the two ' 
dichotomies standardized by the standard deviation of the control group. 
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The appropriate question 'to ask is how closely tHi s formulation agrees 

V 

with the effect 'size calculated from the probit transformation,, viz., 



^ • 2^ - , where 



2^ is the unit normal deviate that marks off the upper (100Pg)% of the area 

un<^er the normal cyrve, and, z is similarly Refined. The ratio of A to 

' ' P 
■ A^- for varioiis values of p^ and p^ is easily calculated. Values of the 

ratio for ranging from .1 to .9 in steps of .10 are tabulated below: 



' A 



Values of the Ratio A II 

P d 



Pg, Proportion of Successes in the Experimental Group 
-3 .4 .5 .5 .7 .8 .9 

D -1 ^-^2 1.14 1.04 0.96' 0.92 0.90 0.91 0.96 

Pc -2 -76 >- 1.27 1.20 1.12 1.10 1.09 1.12 1.21 

. ^-^^ ^-^^ — 1-29 1.20 1.19 1 20 1 25 1 38 

roportion . 1.70 1.47 1 . 38 ' 1.19 1: .1 ' -49 

If successes .5 1.60 1.40 1.31 1.-22 - 1.27 1 31 40 1 60 

-6 1.50 1.34 1.27 1.19 1.24 it { 33 IS "68 

ntrpl group .7 1.33 1.25 1.20 1.17 1.20 1.24 174 

'I ^-21 . 1-'12 1.09* 1.09 1.1 2 1.18 1.27 l-l " 6 

•9 0.96 ' 0.91 0.90 0.92 0.96 1.03 1.14 1 32 



I 

• f 

These ratios are disconcertingly large^ in most cases. . For example, 
if Pg « ,20 and p^ = .10, the effect size calculated from the probit 
transformation is nearly orve-third larger than the effect calculated' - 
from treating the data as a manifest dichotomy. It seems clear thfat in 
spite of the problems of non-normality an^ heterogeneous variances that 
may plague the probit transformation, the calculation of effects from 
dichotomies without consideration of underlying distributions-Is not an 
acceptable alternative. 

.E^ ■ ,87 



^robits at the Extremes . A vexing, problem wi th probit transformations 
from dichotomous to metric data -arises when n cases^ reveal either 0 or n 
"successes." Then the profiortion p = f/n equals either 0 or 1 , and the 
corresponding unit normal deviates are infinite (-» and +=). Consider a 
typical example. Ten experimental subjects are treated for dyslexia, , arid 
at the enaof six months each reads sufficiently well to be promoted . 
(Pg = 10/10 = 1). None of the ten control -groups is promoted (p = 0/10 = 0) 
The corresponding unit normal deviates are = +» and = , and 
^ = oc._(_a,) = 2»: Absurd. Suppose that it wa^ decided arbitrarily to change 
one case in each sample to avoid this problem. Then p^ would be taken equal 
to 9/10 and to 1/10. Now the unit normal deviates are 1.282 and -1.282, 
respectively; and. A = 2.55'4. Suppose a compromise between 0 and 1 
"success" was struck at 0.5 so that p^ equaled 0.5/10 = .05-and, similarly, 
pg = .95.^ The resulting value of A is 1 .545-{-l .545) = 3.290. The 
difference between 3.290 and 2.564 is too large to ignore; and the dif- 
ference of either from » is too gruesome to contemplate. -A^ethod is needed 
for dealing non-arbitrari ly with'£'s of 1 or 0. One, solution is afforded 
by Bayesian stati«»4:ics. 

We shall assume that £ is 'a sample estimate pf tt where, p = ^ and x is 
binomially distributed. The Bayesian posterior distribution oV tt is given by 

Pr(Tr|x) = P^(^) P^(Xh) , 
Pr(X) 

wr>ere Pr(7T) is, the prior distribution of tt- assumed to be uniform on the 
'^Uiterval 0 to 1 . 

/ . 
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^>7iow Pr(x) is given by: . ■ - 

(42) 



Prix) = / Pr (Opj (1 ^-M''"'.d - 



Since. Pr (n ) is^a constant k, and recognizing that the terms in 7 
integrate to a Beta distribution, formula (42) becomes 



.^Pr(x) = k j^j B (x + 1, n - X + 1)', ' " . 

where _ B (u, v) = [r(u) r(v)J / r.(u + v), where 



'r(ij) = - 1): = (u - 1) (.Li - 2) ... 3-2 • 1, 
The distribution of >X given n is simply the bincmial: 



when u is an integer. 



4* 

V 



Pr (x_ In ) = IJ (1 ,n f:' 
Thu5, the" posterior distrubution of n' is given by: 
Pr ( n| X) . 



k ixj (1 -n f-^ 



k (J B (x + 1, n ;- X + 1) 



The Bayesian estimate of n , denoted by n, is'the'mean of the, 
posterior distribution: *, ' ' ' • 

E( n|x) =n = J n . (1 . n ^ n 

, . - . B (x. + ^l, n - X + 1) , ' ^ 

' B (x 2, rt - X + 1 ) • 
B (x-+ 1, n - X + 1). , 

' r(x + 2) Tin • X + r fn ^ ?) _ x ^^ ^ 

* ,■ r(n + 3) r (x + 1) ■ r (n .- X + 1) ' 
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This result is the important one: assuniino a uniform priorfdistn- 



bution for tt, the Bavesian estimate of TT,;the binomi-a1 parameter 



qqual : 



' ^ ° ('^ * 1)/n +"2'where 1 is the sample 'Size- and X is the observ ed numbe;- 
' of suc-c.essgs. .(Solutions are also possible for various, non-uni fom prior 
distributions of Tr,_ especial ly the Beta 'distribution, for- example. ) , 

Th/s resulfoffers^a'non-arbitrJi^Jj^hod -of reso'l-ving difficulties 
|J>gf probit transformation for the cases Of £«.l or 0. If X - 0 in a 
, ■ bin^- sample o'f n, 'then where'as p = 0. the Bayesian" estimate ^ equals 

(0 + l)/(n + 2). Likewise, at; the other , end of -the scale of £ of 1 cor- " - 
responds to a ^ hf fn + ]l/{n + 2). _ For example, in'the illustration- 
, discussed earlier, p^ = lO/lO would yield = lT/12 = ,9,^ and p^ = 0/10 
would gi,ve-^^ = 1/12 = .08. Henc€ -equal's 1.40-(-1.40) = '2,80. T/iis 
-Solution -seems non-arbitrary and reasonable. Having found it, we see no 
reason why n should hot be applied across the board, that is, regardless 
of the value of ^= X/n-, if a uniform prior distribution of tt is reasonable, 
the, V should be taken to^ual ^ = (X"+ l)/(n + 2). 

• An ^interesting problem arise^s when onels purposes are study integration. 
■ _ Suppose that ten separate studie§ of fi ve, t)ersons each yielded 'identical 
results, one^f five "?ficcesses .'" Each value of £ would equal 1/5, and* 
the average of al'l the £'s or the pooled valu^ across th€ ten l^bdies would 
both equal -.20..^ However, the average of the Bayesian estimates would be 
(tt + + ;^)/5 = S'{Z/7)/S ' .29. The B»yesi^ correction in sm^ll 
pies can be substantial, even though in a pooled sample it would be 
.-M-nsigni.fi<ant,' e.g., ^p^^^^^ = n/52 = .21 vs. 10/50 = .20. Thus'the 
. ^ ' average of many small sample Bayesian estimates can be quite different from 
. ^ a -pooled Bayesian estimate. A.pooled estimate would seem'preferable, b'Djt 
_ pooling obviates the examihation of study-to-study variation i'n findings, 
ErJc^^^^'^ is '""'^^ i^ the spirit of our approach to integrating reseafch. 



OUTCOMES OF CORRELATIONAL STUDIES 



In the meta-analysis of cc^rrelational studies ,* one, is integrating 
correlatfon coefficients descriptive' of the relationship between two 
variat?les, such as achievement' and socioeconomic level, or teacher 
personality and pupil learning. The afuantitati ve description of findings 
from correlational gaudies presents fewer complications than dO(«5cperi- 
mehtal studies. 

. Illustrations of the integrative analysis of correlational studies, 
will be drawn from a study of the relationship between pupils' socio-- 
Q^om^c Status (5ES) and' their academic achievement. White (1976) 
collected over 500 correlation coefficients from published and' unpubl i shed" 
literature. The coefficients were analyzed to determine how their 
magnitude y/as related to varying definitions of SES, different type^of 
achievement, age of the subjects, and so on. White found that the 63b\ 
available correlat,ions of SES and achievement averaged .25 with a 
stindard deviation of about .20 and positive skew'. Thus, SES and achievement 
correlation i? bebAjbat is generally believed to be the strength of 
, association of t^PPvariables. The correlation diminished as students 
got oTder, r- decreasing from about .25 at the primary grades to around 
.ISJate in high school. SES Correlated higher with verbal than math 
.ach^vement (.24 vs. .19 for. ;74 afid 128 coefficients, respectively}. 
When White classified the SES and achievement correlations by the type of 
SES measure employed (see TableS.e), SES measured as income correlated 
more highly with achievement than either SES measured by the education of 
the parents or the occupational level of the head of household. ' . 
Several reliable trends in the collection of 600 coefficients could help' 
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Rfiethodologists designing studies and sociologists corrstructing models of 

th^ school ing-soc'ial system. / 

It probaJbly matters little whether analysis is carried out in the 

metric of r^^, r^^ or Fisher's Z transformation of r^^. The final results 

ought to b& expressed in terms of the familiar r scale, however. 

^ . xy. 

There appears to be no good reason to tran^TOrm r to Fish^'s Z at the 
intermediate stages of aggregation and analysis, though this is sometimes 
rec omrnfcnded. Fisher's transformation was d^eloped to solve an inferential 
problem, and it would be an unlikely happenstance if it proved to be the 

Si' 

method of choice for combining correlation measures from several studies. 

It is frequently reconmended that two or mor^ r 's be squar*ed, averaged, 

xy 

and the square root taken rather than averaged directly. However, it is 

fairly easy^to show that the choice ^seldom makes a practical difference. 

A little algebra applied ^o the ratio of (r^ + r^ll toVr^ + r2)/2' will ^ 

show thert the discrepancy between' the two depends primarily on the size 

of the difference between r^.and r^ and that they must'be enormously 

different for the two averaging methods to differ in any important way. 

For example, the three coefficients,— .20, .30, and .40 — average .30 

directly; and they average .31 if first squared and averaged, and the 

square root is>'taken. & gap of approximately mor«-Nthan .50 between r. 

/ 2 N 2 ' 

and r^ is needed to separate (r^ + r2)/2 and V(^i ^2^/2 by more than 

.05. The researcher can safely decide whet'^r the scale of r^ or r^* is 

xy ^ xy 

more meaningful to him and work in that metric throughout an integration 
of correlational, studies. 

The correlational studies referred to here deal with ordinal, 
metric variables. Correlational results which involve genuine dichotomies 
^ or polychotomies (e.g., sex, ethnic group) should be recast into more 

, . ■ -20.; , 
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Table. 5.8 



Average Correlation between SES and Achievement fpr 
"^'^ ^ Dif terent Kinds of SES Measure 

- SES Measure' Average r • 



Indjcaiors of parents' tncome . .31 5, ( 19) 

Indicators of parents' education * ^ 185(116) 

Indicators^ of parents* occupation level ^ 201 ( 65) 



of ootficum n*r»ya n pvwuriMM 

'i 



infonnative desci^-iptive measures such as standardized differences among 
tneans, and the techmques of "effect-size" measurement discussed above may 



then be applied. Where the two variabl|s correlated are conceived of as 
having- metric prop^^.t^es -- even if the technology of measurement at the 
time fell short of actual metric measurement — then one ought to seek to 
^transform all correlation measures to the scale of Pearson's product-moment 
, correlation coefficient. 

When a large field of correlational research is collected, a 
bewildering varietj^.^of statistics is encountered:, biserial and- point- 
biserial correlation coefficients!, rank-order correlations, phi coefficients, 
"contingency coeff regents , contingency tables with chi square tests, t-tests, 
analyses of variance, and more. In White'? analysis of SES and achieve- 
ment _correl at ion ^a- variety of methods of reporting what was basically a 
correlational finding was encountered. Of nHstudies, 37 reported t or 
F statistics, Vl reported Pearson r's, 8 reported chi square or non- 
parametric statistics, and 27 presented only graphs or tables of means. 

There usually is an algebraic path from the reported statistics to 
« Pears^on correlation coefficient or an approximation to one. Some signposts 
along the paths are set out in Table 5.9, where it if.indicated how one 
might travel from particular forms of reported data to a product-moment 
O correlation mea'sure. 
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Table 5.9 



^GuidGlines for Convening Vanoiis Summary 
Stat^istics IrHo Product-Moment ■Correl ations 



Reported,Statistic 



a) Point-biserial 
correlation, 'r- 



Pb 



k ) 4. 



^1 - h 



Transformation to r 



xy 



u = ordinate of unit normal 

distrib'ution 
n = total sample sizs • / 



PD 



the 



* (n, + - 2) 



i- 



len convert r. to r via 
* a) aoove. 



References 



Glass and Stanley 
(197a, p. ■171) 



Glass and Sta^nley 
(197C, p. 313; 



c) t based on extreme 
groups. 



d) F=MS,/MS for 

0 w 

J * 2 groups'T 

e) F = MS^/MS^^^ for 
J >^.2 groups. 



0 = 



n 
D 
2 



= with in cell n. 
= VODprtion cut at each end. 
= ordinate on normal curve at the 
cut. 

- standard normal denote corre- 
sponding to p (abscissa value) 

_5 = UK 
then proceed via b) above.^ 

I) Collapse J groups to 2 & 
then proceed via d) above, or 



Based on FelcJt , 
'^sychcmetrTca , • 
1571-, p. 315. 
ReaTanged by 
Glass. 



Hays (1973, 
pp. 683-584) 



f) x' only (i.e., no 
frequencies reported) 
for a contingency 
table. 



**r = p 

xy 



,2 \i 



n = total sample size 



Kcnd<Tll & Stuai-t 
(1957, p. 557 ff) 



194 



2(1 



Table 5.9 Continued 



Reported Statistic 



Transformation to r 



xy 



References 



^) 2'x 2 conti ngency 
table. 



h) - R X C continaency 
■ . table. 

i) SpeaTnan's rank 
■correlation, r . 



j) Mann-Whitney 



Calculate tetrachoric 

""^^ from tables 
xy 

Ccllaose to a 2 x 2 table 
■and proceed via c) aDove. 

'"xy ' '"s t'-anslation cf 

""j to r^^ !/nde'"^^^y^'V3^i a te ncrri- 
ality is nearly a straigrt line. 

Trans^onc U co r-^rank-biseriai 
via r 



■rb 



1 - 2U/tn,n2). 



Glass and Stanley 
( 1970',.p. 165 ff) 



Kruskal (1958) 



Wills or. ( 19 76) 



xy 



I ^•^^'■pb """-^ p=^>js between .2 and .8 (Magnusson., l^S, p. 205). 



P is Pearson's coefficient of contingency and p^' - p^ as the number cf cate- 
gories in the table increases. With few categories, the estim,ate can be 
unduly low. 
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Another connon S'nstance of transforming results involves convertina a cor- 

relation, r^, into a standardized mean difference. For eixampie, Colem.an's survey 

\ 

cf equality of educational opportunity reported a correlation Qoeffice'^nt between 
class-size, X_, and achievement, But most oiher stud-li? reported tne relationship 
in terms of means and variance on achievment for particular class-si^s, leadinc 
to the measure L^^^ described in first section of this report. Knowing, only 



i ng 



anc X a-nd s , tne measure , can be calculated assurrinc a normal distribution 
of £ and a linear relationsnip of and X- Values c'' S_ and L must be specified on 
X, tney can be arbitrarily designated as any twc convenient percentiles, e.-g., P 
2-nd lOC^P. Then S = X - zs and L = X + zs , wnere z >s tne unit norm.al deviate 

A X 

r 

at tne* percent! le 100-P. 

^^^^ Jlxy» calculate the' regress ion line of on from 

K 
U 



J f ' r (s /s ) ; and 

yx xy ^ y' * 

bp = Y - b X. 
C yx 

Tne mean .values of_ i corresponding to S and i are ca.lculatec by Substitution 



nto the regression .equation. Tffe within group variance on Y is simply 'the 



variance error of estirr.ate, known to equal Sy(l - r^).- Compining these facts leaas 



to 



.r is the unit nonrial _ devi ate at the Pth percentile of the normal cureve (S being at 
the Pth percentile in the distribution of X and L being at the lOC-Pth percentile 
ofX). " ' 

The above conversion seems unobj^fttionabl e , and surely is provided that X ' 
IS roughly nonrially distributed and the regression of ^t^ and X is linear. However,. 
wherr.Y has a curvilinear regression on X, ^e value of , will be somewhat in 
eri^or. " ^ 



NONPARAMriRIC MEASURE QF 
EXPERIMENTAL EFFECT 



\ 



. Kraem^r-aHd AndfewsM 1980) have recently devised a descriptive, 
measure of'feffect size that appears to have advantages over traditional 
s.tandardized mean difference measures. Their measure is based on 
frequerrcy statistics and the inverse normal transforation . The most 
important property of the Kraemer-Andrews measure is tnat it is invariant 
^v*7th respect to monatonic transformations of the dependent variaole. 
Ac this is written, it- is too soon to evaluate the utility of this 
new measure, but early reactions seem promising. 
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' CHAPTER SIX 

S 

■ METHODS OF-ANALYSIS ' " 

The analysis of data in a meta-analysis is properly aporoached as 
an instante of^multi-variate data analysis in which the studies sre the 
units on which measurements are taken and the study characteristics 
(Chapter Four) and findings (Chapter Five) are the many variables. Tne 
_ froint of having come this far in our treatment of meta-analysis is tne 
oelief that theyimport of many studies described in many ways cannot 
be grasped by the reader without the aid of techniques of arranging, ' 
ordering, relating — in short, without the help of statistical methods. 
Univariate description; frequency tabulations, correlations, linear ^ 
model estimation, regression, analysis , factor analysis, analysis of 
covariance, discriminant- functi on analysis ar>y of tne methods of. 
■ statistical analysis that have proved to be useful in extracting meaning 
from data are potentially useful in meta-analysis. One's attitude toward 
the data may be exploratory (Tukey, 197X) or confirmatory, descriptive 
or inferential; it doesn't matter. We are breaking-/io new ground here, 
/"we are merely illustrating, the application of well-known, statistical 
methods in a context in whictt-researchers are prone to forget that they 
are as "useful, indeed necessary, as in other familiar contexts. 

In this chapter, we shall first deal briefly with the simple 
^^nivariate descriptive analysis of study findings. Then we shall' 
describe methods of examining the correlation of study findings and 
characteristics^. Third, the estimation of treatment effects where study 
findings can be arranged in the manner of factorial experiments will 
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be investigated. Fourth, attention will be given to the special possibilities 

ft ^ 

of integrating study findings where both the. independent and dependent 
variables are measured on quantitative scales. Fifth, problems of 
statistical- inference as they appjy in meta-analysis will be discussed. 

SIMPLE DESCRIPTION OF STUDY BINDINGS 

Once the "findings of the studies in a meta-analysis have been 
measured (whether by means of an effect size, a correlation coefficient 
or wnatever), all the standard methods of tabulating and describing 
statistics may be usefully applied: frequency distributions, averages, 
measures 'of variability, and the like. In this respect, we much prefer 
I'ukey's (1977) innovative and ingeneous methods of exploratory data analysis 
to the ujvimagi native lot of techniques presented in most statistical 
methods textbooks. An illustration might help the reader understand our 
preference. 

El-Nemr (1979) found 59 experimental studies in which were 
compared traditional teaching of biology and biology taught as a process 
of inquiry. These studies yielded nearly 250 effect size" measures in 
which inquiry-teaching was compared witn traditional' 'teachi ng of biology. 
The effect size measures seven categories descriptive of type of outcome: 
science- achievement, science process skills, critfcal thinklng'skills, • 
laboratory skills, attitudes toward the biology^ourse, interest in 
science, and "composite" (an average of the preceding outcomes). Plots 
of the characteristics of the distributions of effect sizes for each 
outcome category appear as Figure 6.L / 

Consider the first category of outcomes in- Figure 6.1. The 59 

199 

213 



experiments yielded 30 -effect sizes based on the measurement of achievement 
(since achievement was not measured in every experiment). Each effect 
size is -of the form 



I - T 



The distribution of tne 39 achievement effect sizes is described 
by the lines, letters and dots above "Achievement" in Figure 5.1. The 
basic descriptive technique is the "box-and-whisker" plot with auxilliary 
features. The centr>el box or rectangle marks off the "hinges" (roughly, 
the first and'third quartiies) of the distribution of effect sizes and 
the median (ordinary definition) as the sizes l-i£ between the top and 
the bottom of the box with 25 percent of those inside the box on either 
side of the median. The hinges for ihe achievement effect sizes are 
at .02 and .23, approximately, and the median is at .17. The large black 
dot inside^^e box indicates the" location of the average of the 39 effect 
sizes; for achievement, the mean is above the median. ^Jhe dotted line 
emanating from both ends of the box measures the distance to the "inner ' 
fence," a distance arbitrarily chosen to be one-and-one-half times the 
length of the box (i.e., 150% of the hinge range). The lower-case letter 
f marks the inner fence. Data points that lie outside the inner fence 
are- "outliers," and each is denoted by a small dot. At the same distance 
beyond the inner fence' that .the inner fence lies beyond - the ends of 
the box one marks off the "out^ fence" with an upper-case F. Data 
.points beyond the outer fence^are "far outliers." One casts a suspicious 
eye at outliers a^id looks witTTeven greater kredul ity on far outliers. 
They may represent oddities (measurement reporting errors, misprints, 
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39 30 



19 



. 17 
,20 



.29 

.50 



,23 
18 



.55 
,87 



.29 
38 



-.13 
•.01 



— T 



« 

in 
0 
a 

E 

0 

u 



133 



,23 
32' 



Summary statistics for .effect sizes in seven 
classes of outcome from comparisons of inquiry vs. 
traditional, teaching of biology. (After El-Nemr, 1979) 
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. ~ miscalculations, and whatever) that ought to be eliminated or given^ 
different weight in describing the typical features of the. data, 

. Notice, for example, that among the 39 achievement effect sizes 
in Figure 6,1 there are four outliers and two far outliers. If the two 
far outliers are eliminated and the average effect size recalculated, the 
average drops from ,20 to .10. The median drops a little, but less than 
the SO.percent drop for the mean. Consider the "Process Skills" outcome 
• category. Here, a substantial discrepancy exists between the median and 
the mean with the latter one and two-thirds times larger than the former. 
But the mean is probably distorted by the single far outlier of 3,0; 
removing this outlier drops the mean to ,41, because of the positive 
skew in the data for process skills shown by the fact that the median is 
far closer to the lower hinge than the upper. Generally the means are . 
larger than the medians, except for "Critical Thinking" where the order 
is reversed,' »And although the inquiry approach to teaching biology was 
superior to traditional teaching in most respects, it was no better at 

ng pupils' interest in science. . ■< 

Correlating S tudy Characteristics -and Findino. 

The next step "beyond^ the simple description of study findings ■ 

is the study o'f the relationship between study characteristics and findings 

This second stage of analysis is- addressed to such questions as whether 

the findings, are homogeneous for all- types of subject (e,g,. person) or 

whether theyire positive for some types of subject and negative for 

others, whether the findings are strong when viewed with certain research^ 

methods (e.g., subjective outcome appraisals), whether the short-term^ 

findings differ substantially from the long-term results and so forth. 
O ' ' -202 
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Any one of the many .statistical techniques for studying the associa'^ion or 
rfelationship between two variables may find useful application at this - 
stage:, contingency .table analysis, regression analysis, correlation analys 
with its many subspecies (e.g*. , Pearson's r, point-biserial or biserial 
correlation, curvilinear correlation). Since study findings will be^. ' 
measured on metric scal^e (i r , etc.), metric measures of relationship 
deriving from Pearson product-moment notions will be the most powerful 
and useful . 

Consider an illustration. In their first meta-analysis of the 
effects of psychotherapy, Smith and Glass (1977) compiled several 
hundred effect size measures tjor nearly four hundred controlled outcome 
evaluations. Among the characteristics> of the studies coded were the 
following: r ' . 





Characteristics 




Coding . . 


1) 


"Organization of therapy 


1 = 


individual , 2 = group. 


-2) 


Duration of therapy 


No. 


of hours. 


3) 


Years experience of therapist 


No. 


of years.. 


4) 


Client diagnosis 


1 = 


psychotic, 2 = neurotic 


5) 


IQ. of clients 


1 - 


low, 2 = medium, 3' = high 


-.6) 


Age of clients 


Age 


in -years. 


7) 


Social -economic-cultural 
similarity^of therapist & clients 


1 = 
4 = 


very $jfliilar, . . . , , 
very dissimilar. 


8) 


Internal validity of study 


1 = 


high , 2 = mediurpi 3 = low. 


9) 


Date of publication of study 


Year 


10) 


"Reactivity" ol^utcome measure 


1 = 
4 = 


low, 2 = low ave. , 3 = ave 
high ave.'; 5 = high. 


11) 


No. of^months after t^herapy of 
outcome measurement 

203 ' 21 7 


. No. 


of mo^iths 



• Each of the eleven study characteristics was correlate^ with the 
effect size. The finear cforrelation coefficients obtained are reported in 
Table 6.1. 

^ ' Table'e^l 

i 

Correlations of Several Descriptive 
Variables unlk Effect Size 



ConrUnon 
effect wtc 

Organiiapon (I - individual, 2 - group) -:07 
Duration of therapy \\n hours) -.02 
Years' experience of therapists oi 
Diagnosis of cbenu i 

(1 ■ psychotic; 2 ■ nrurolic)^ .02 
IQ of clients 

(1 - low; 2 - medium, 3 - high) 
Age of clients 
' Similarity of therapisu and clients ^ 

(1 - very similar, . . ; 4 - very diisimiiar) -.19* 
Internal validity of study 

(I * high; 2 - medium, 3 ■ low) 
Date oV publication ' , 

"Reactivity" of outcome measure # 
(I ■ low; ... ; 5 - high) 



.15* 

.02 



of months posttherapy for (piiou-^up -.icr 



• f < .OS. 
<.0l. 



■ <~ ^. ' ■ . 

The correlations are generally low, although several are reliably 
non-zero. Some of the more interesting correlations show a positive 
relationship betweev^rT estimate of the intelligence of tfifc g^oup of 
'clients- and the effect of therapy, and a somewhat large'r correlation 
indicating that therapists wKo resemble their clients in ethnic group, . 
age, and social level get beg^r results. The effect sizes diminish, 
across time after therapy a5 shown by the Ust correlation in Table 6.1, a 
correlation of - .10 which is closer to -.20 when the curvi 1 inearity of 
the relationship is takel^into account. The largest correlation is with* 
the "reactivity" or subjectivity of the outcome measure. ■ -The multiple 
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correlation of the e^leven study characteristics with the effect siz^ w^s 
equal to about .50; thus, 25 pergent of the variance in study findings 
can be .accounted for by 'variatiojis -in the ch4d!:teristics.\)f the studies. 
There is not space here to pause and consider the many implications of the' 
relationships reported in Table 6.1; in this example, they are numerous, f 
and they have not escaped ^her tbose who comment on the benefits of 
psychotherapy or those who concern themselves with the methodology of 
its evaluation (see Chapter Seven for further discussion of this point). 

A more controversial u^e of the relationships of study characteristics 
to findings involves, the attempt to equate various classes ^of studies 
and*th^ observe comparative results. Imagine a simple hypothetical ^ 
example. Either medication or hypnotherapy can be prescribed for asthmatic 
-children. A set of 50 controlled experiments on the effects of medication 
show an average effect size of .75; 60 experiments with hypnotherapy give 
an average effect size of .40. It is observed, however, that on the 
average the medication experiments measured effects one month after 
treatment whereas the hypnotherapy experiments measured outomes at six 
months. Furthermpre, within each class of experiment,' the regression 
^coe^icient of A onto "follow-up time" is abovi the same: 

medication: a = .83 - .08 (No. of months) 

Hypnotherapy: L = .65 - ^8 (No. of months) ^ 

If the effects of both treatments are estimated for follow-u|3 
times of one month, f the . 35^ifference in the uncorrected average 
comparison (.34 « .75 - .40) shrinks to .75 - .57 « .18 standard deflation 
'units difference between the means of the treatment and control groups. 
Q \ If ,the regression of effect onto follow-^ time wer^ heterogeneous in 
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the regressions slopes between the two therapies ^ the estimated order ^ ' 
of superiority could change from one follow-up time to another. 

In our analysis of psychotherapy effects, the regression of effect 
size onto ten independent variables was performed separately within three 
qi/ite different classes of psychotherapy: psych odynamic, "systematic * 
desensTHzation, and behavior modification. The results of the three 
multipj^egrfession analyses appear in Table 6.2. 



Table 6.2 

Regression Analyses Williin Therapies 



L'nnan'lartliied rrtrmion cocftcimu 



P«> chod>nam)C 


nemati.c 
Uc«en«tuation 
{m m f\2) 


mod location 
(h • 129) 


.174 


-.193 


.041 


-.114, 


2Q\ 


20\ 


.002 


-.002 


.002 


-.011 


-.034 


-.018 


-.015 


.004 


-.033 


-.rii 


J87 


- 015 


. .182 


.088 " 


- 163 


.108 


-.086 


-J7^ 


-J)31 > 


-.047 


.007 




.025 - 


.021 


0 


.489 


453 


^423 


.512 


.509 


.173 


^ -386 


MO 



Diagnosis (I « psychotic; 2 « neurotic) 
InuJligence (1 « low; . . . ; 3 « high) 
Tnnsfonncd aire* 

Experience of Therapist X Neurotic 
Experience of Therapist X Psychotic 
Clients self-{>resented 
Clients Bohdted 

Organiatioo ( 1 m individual ; 2 « group) 
Transformed months posttherapy^ 
Transformed react2\nty of mcasurf 
Additive ^j^tant 
Multiple R 



• TrBu<onnnJ an » (A(t - UKIAte - 2$|)'. 
' TraMformcl monthi i<ontl>rra|i\ - (No. mentlu)' 

• TrB»«onii«d ttmaiWtv el nwamn » (Knctivltr)''" 



Relatively complex forms of th^ independent variables were used 
to account for interactions and nonlinear relationships. For example, 
years'' experience of the therapist bore a slight curvilinear relationship 
with outcome, probably {jecause more experienced therapists worked with- more 
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seriously ill "clients. This situation was accommodated by entering, as 
an independent variable, "therapist experience" in interaction with 
"diagnosis of the client." Age of client and follow-up date were slightly 
curvi linearly related. to outcome in ways most directly handled by changing 
exponents. These regressicfi equations allow estimation of the effect 
size a study shows whef^ undertaken with a certain type of client, with a 
^therapist of a certain level of experience, etc. By setting the indepen- 
dent' varia^^ at a particular set of values, one can estimate what a 
study of that type would reveal under each of the three types of therapy. 
Thus, a statistically controlled conparison of the effects of psycho- 
dynamic systematic desensitizati on, and behavior niodificati cn therapies 
can be cfctained in this case. The three regression equations are clearly 
not hoDogeneous; hence, one therapy might be superior under one set of 

M. 

circumstances and a different therapy si/^erior under others. A fLi1>l 
descriptiorv of the nature of this interaction is elusive, though one 
can illustrate it at various particularly interesting points. 

• Iij. figure 6.2 estimates are made of the effect sizes that would 
be shown for studies in which simple phobias of high-intelligence subjects, 
20 years -o^f ,a9€, are treated by a therapist with 2 years experience and 
evaluated Immediately after therapy with highly subjective outcane measures. 
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ESTIMATED EFFECT SIZES 

PSYChOOYNAMIC 0 9l9 

SYSTEMATIC OESENSITIZATION 1,049 
BEHAVIORAL MODIFICATION I 119 




X 



Figure6-2. Three wi thin-therapy regression equations set 

to describe a prototypic therapy client (phobic) 
f and therapy situation. 
/I 

« 

This verbal description of circumstances can be translated into 
quantitative values for the independent variables in Table 6.2 and 
substituted into each of the three regression equations. In this instance, 
the two behavioral therapies show effects superior to the psychod^^amic 
therapy. ' ^ * 

^ -^In Figure 6.3 a second prototypical psychotherapy client and 

situation are captured in the independent variable values, and the effects 
of tl\e three types of therapy are estimated. Fdr the typical 30-year-bld 
neurotic of average IQ seen in circumstances like those that prevail in 
mental health clinics (individual therapy by a therapist with 5 years 
experience), behavi-or modification is estimated to be superior to psycho- ' 
dynamic therapy, which is in turn superior to systeihatic desentizati on at 
the 6-month f^ollow-up point. 

Beside^ illuminating the relationships in the data, the quanti- 
tative techniques described here can give direction to futflre research. 
By fitting regression equations to the relationship between effect size^ 
and the independent variables descriptive of the studies and then by 
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ESTIMATED 'EFFECT SIZES 

PSYCHOOYNAMjC 0 6^3 

SYSTEMATIC 0£S£^S!T!2ATION 0 516 

BEHAVIORAL MOCinCATION 0.8^ 
I 




CONTROL 



Figure 5.3. Three withih-therapy regression equation^ set 

to describe a prototypit therapy .clien^ (neurotic) 
and therapy situation. / 

placing confidence regions around these hyperplanes, the regions where 

the input-output relationships are most poorly determined c^ be identified^ 

By (;oncentrating new 'studies in these regionss one can avoid the accummu- 

lation of redundant studies of convenience that overelaborate small areas. 

' - ' I 

Linear ANOVA Models for Estimation 
V of Effects ' 



Collections of experiments often present odd arrays of comparison to one 
Who wishes an integrated sunmary of- effects. For example, an integration of 
reading instruction research would encounter experiments comparing Initial 
TeacMog Alphabet (ITA) and TraditionaJ Orthography (TO), other experiments 
comparing ITA and Diacritical Marking (DM), and still a third type of experi- 
ment in.which'TO a.rvJ DM are compared; For each comparison, a standardized mean 
contrast can be calculated (e.g., a' = (1^^^ . "^tO^/'x^' integration 
of these variobs &'s fnto a estimation of the effects of the three individual 
instructional methqds is not immediately obvious. One fruitful 'approach is 
via "effects coding" and the general linear model. For example, the following 
model can be postulated: , ' • . v 

Z03 ■ 



i lA 1 ""TO 2 ^DM 3 • 
The variables and take on /he values, 1, 0, and -1. 

If, for examples particular l is based on \n experimental conparison of 
-ITA and TO, then = 1, X^ = -1 and X^ = 0. In this way, ma.ny L 's can 
be regressed onto the X's;and the s's, which are individual effects V the. 
instructional metnods, can be estimated. 

The technique of "control referencing" that was dealt with briefly . 
in Chapter Five ca'n be approached more conveniently through use of the 
linear effects models of this section. 'Suppose, for example, that there 
'exist- n_ experiments in which treatment A is compared to a control group, 
n experiments in which B is compared with a control group and n experiments 
in w^ich A and 3 are compared d;[rectly -without a^control group. There 
are,. thus, three types of effect size measure: - ^ „ and „. 
A simple modification of the general linear model like that in (1) aboVe 
suffices to describe the effects: 

Xi + X- + e. ' ' (2) 



A "1 B ^2 
X^ = 1 if ^ is of the form A vs. Control, 
\ X^ « 1 If ^is of the, form B vx. Control 



( 



X^ = +1 and X^ « -1 if A 'i# qf the form A vs. B. 

For the equal n.'s example, the data, the design and the 
parameter matrices are as follows: 
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+ e 



1 
0 



0 
1 



s 



^ B 

i A - B 



A - B 



0 



-1 



Denoting the design matrix .by X, the least-squares estimates of the 
effect parameters are given by 

■ The form of (x'^X)'^ and I^l' are as follows: 



{x'x)-^. 1 



2/3 
1/3 



1/3 
2/? 



X = 



B A-B 



Therefore, the estimates of the aggregate effect sizes for 
treatments A and B are given by 



211 



P9.- 



1/3(2 % 



3 A .'A-B^ 



Where the bar above the delta indicates simple average, 



A related, but slightly more complex, prv^srem involves treatment components 
wnich can be evaluated separately or i^ cornpination in experiments. Consider, 
for examole, the treafnent of psycnological "disorders by either drugs or psycho- 
therapy or ioth. • 

The experimental literature'^on drug and psychotherapy addressed tne estima-. 
tion of the separate and interactive effects of drugs- and psychotherapy in a 

'I* 

variety of ways. The variety is a nuisance. Several types of experiments can be 
Identified whicn inform one about the drug effect alone, or the drug plus the 
ifiteraction effect, or the psychotherapy plt/s the ^rug plus the ,i nteraction effect, 
and so on in various combinations. An experiment that compares clients' progress ' 
under drugs with a group of clients receiving a placebo or nothing .estimates tne 
simple drug effect. Whereas an experiment that compares two groups of clients 
one of which receives drugs-plus-psychotherapy and the other of which , receives . > 
only drugs provides an estimate of the psychotherapy plus the. interaction effect, 
Since one group has the possible advantage of the separate psychotherapy effect 
and any benefits that result from combining drugs and psychotherapy. Denote the 
drug effect in isolation when compared with a placeWb or no treatment by 6; denote 
the separate psychotherapy effect by ¥ ; and denote the interaction effect of the 
tw6 by n. Then the comparison of drug therapy and placebo in an experiment estimates 
6 . The comparison of drug-plus-psychotherapy with psychotherapy estimates 6 + n 
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because both sides of the comparison have equal psychotherapy effects. , In 
Table 5.3 appear the possible experimental conparison of drug and psychotherapy 
and what effects these comparisons estimate. 

By arranging and avera.gi ng, the results from experiments of the six 
different types specified in Table 6. 3, the separate and interactive effects 
of drug and psychotherapy can be estimated. The organization of data and 
unknown parameters in Table 6.3 can be viewed as a system of six sources of 
information and three unknown parameters. Least-squares estimates'of the 
parameters can >e calculated by ordinary methods. 



Table 6.3 



'he Structure of Experiments on the Effects 



of Drug and Psychotherapy 



V 

Treatments Compared .in the Experiment 



Effects Estimated 
by the Comparison 



A. Drug vs'. Placebo (or No Treatment) 

B. Psychotherapy vs. Placebo 

C. (Drug & Psychotherapy) vs. Placebo 

D. (Drug & Psychotherapy) vs. Drug 

E. (Drug it Psychotherapy) vs. Psy 

F. Drug vs. Psychotherapy 



6 

* 

6 + t + n 
+ n 
6 + n 
6 - n 
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If one wished to maintain a distinction between placebo and no-treatment 
control groups, there would be twelve lines in Table 5.3 instead of six and 
the structure of effects would change slightly; .for example, a Drug vs. No- 
Treatment experiment would estimate the drug plus the placebo effect since 
the expectancy effect of administering the drug to the experimental group would 
not be counter-balanced by an expectancy effect for the no-treatment control 
group. 

In a meta-analysis of psychotherapy research, the question was addressed 
of the main and interactive effects of psychotherapy and drug therapy. A 
total of 112 studies was coU^ed, each of which addressed the question in 
•part with one or more experimental comparisons. These 112 studies yielded 
566 effect-size measures (i.e., standardized mean differences). For example, 
a study in which drug treatment was compared with combined drug and psycho- 
therapy treatment, a standardized mean difference of the following form would 
r^sylt: . « (X^p - "Xp)/s^. In Table 6.4 appear the actual 'average 
effect sizes calculated from the findings of the 112 experiments. 

As an example of hew Table 6.4 can be interpreted, consider the first 
line of entries. A total of 55 comparisons in the 112 studies involved contrasting 
the scores of persons who received psychotherapy with those who received no 
treatment or, at most, a placebo. Such comparisons estimate the magnitude of 
^the psychotherapy effect, i> ; the estimate equals .30, i.e., the psychotherapy 
groups averaged three-tenths standard deviation superior to the control groups on t 
outcome variables. Consider as a second example the 94 comparisons of drug-plus- 
psychotherapy with psychotherapy alone. Such comparisons estimate the sepal-ate 
drug effect. 6, and the interactive effect, n, which results when drug and psy- 
chotherapy are combined in the same treatment. The psychotherapy effect, , H 

■2U ^.^.^ 
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Table 6-4 

Average Effect Sizes from Various Experimental Comparisons 
Made in the. Experiments on Drug and Psychotherapy 



Parameter(s) Average No. of 
Comparison 5 Estimated L a^s 



Psycnotherapy vs''. No-Treatment or^ 
Placebo' 








4 

.30 




55 


Drug' Therapy vs. No-^Treatment or 
Placebo 




& 




.51 




351 


Drug & Psychotherapy vs. Drug 




+ 


n 


.41 




10 


Drug & Psychotherapy vs. Psychotherapy 


6 


+ 


n 


.44 




94 


Drug vs. Psychotherapy 


6 




4- 


.10 




7 


Drug & Psychotherapy vs. No-Treatment' 
or Placebo ' 


• 

6 + 




+ n 


.65 


) 


49 



Note. \i> fienotes^the separate or "main" effect of psychotherapy; 
5 denotes the separate effect of drug therapy; and 



n denotes their interaction. 
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not reflected in the contrast because it is present on both sides of the compar- 
ison. The 94 effect sizes which es^mate 6 + n have an average of .44 . The 
remainder of the table"can be understood in like manner. 

From simple inspection, it appears that the drug effect of .51 is more than 
half again as large as the psychotherapy effect of .30 . The interaction effect 
is sliV>tly more difficult to comprehend from merely inspecting the entries in 
Table 6.4. ThH^^ drug-plus-psychotherapy vs. drug comparison, which estimates 
•-y + n=, 'if a foil one-tenth standard deviation larger than the .30 estimate of i) 
from the first line of the table might lead one to believe that n is positive; 
but the cotpparison of the estimates of 5 + n and c (being .44 and .51, respectively) 
reverses thiS' impression. Inspection is too arbitrary and confusing. Several 
comparisons in the table contain information -about the same parameters; it seems 
reasonable that every source of information about a parameter should be used in 
estimating it. A complete and standard method of combining the data in Table 6.4 
into estimates of the parameters is needed. Such a method is suggested when one 
recognizes that the two middle columns of Table 5.4 constitute a system of linear 
equations, three of them independent and containing three unknowns (ij;, 6 and'n). 
The method of least-squares statistical estimation can be applied'to obtain 
estimates of the separate and interactive effects of drug and psychothera.o^^. 

41 
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The data and parameters of Table 6.4 can be written as a set of 
simultaneous linear equations as f&llows:' ^ 



r3oi 

.51 
.41 
.44 
.10 
.65 



1 
0 
1 
0 
0 

--1 



0 
1 
0 

1 
1 
1 



0 
0 

1 
1 
1 

0 



1 1 1 



^Denoting the vector of data by A and the design matrix i?y X, the 
solution for the parameter estimates is as follows: 



(X^X)"^ 



n 

L J 



1/2 1/4 
1/4 1/2 
-1/2 -1/2 



■1/2 
•1/2 
1 



, and 



(3) 



/ 



X^ A 



1.26 
1.70 
1.50 
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to be 



Hen'ce, the estimates of the parameters are found fran (x'^X)'^x'^A 



/ 



n 



.31 
.42 
.02 



J 



Each effect is expressed on a scale of standard deviation units. %^ ' 
■ Thus, the data of Table 5.4 l^ad to the conclusion that with the groups of 
clients-^tudied psychotherapy produces outcomes that are about on6-th1rd 
starltlard deviation superior to the oi^tcomes from placebo or untr§«?ted control 
graJ;>s. T^e drug effect is only about a third greater than the psychotherapy 
effect. An effect of .^Is^ will move an average client from the middle of ' 
the .control group distribution t» about the G2nd percentile; an effect of ♦ 
. .42 -would move the average client to only about the 66th fterce'nti le. 
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, INTEGRATING. STUDIES THAT HAVE 
• . QUANTITATIVE INDEPENDENT VARIABLES 

Many bodies of research literature iriN^ve the examination of the- relation- 
ship between dependent and independent variables, both described quantitatively, 
inhere the quantitative character of the independent variabl^carv be preserved, 
the gain in precision of the integration of findings can be tonsiderable. 
r^tfii^es of problems where this is true include class-size and achievement, the 
duration of effects of any treatment, study time and achievement, and countless 
laboratory problems in-the social sciences. Consider, for example, a research 
integration problem faced by Undenvood (1957) in his work on memory. ^ Over fifteen 
studies were available to him addressed to the question of the efficiency of recall 
as a function of the ordinal posi tion of the items to be recalled in a s-erie's of 
lists. Underwood plotted the curve reproduced below as Figure 6.4 and concluded 
that efficiency of recall was largely a function of interference from items 
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NUMBER OF PREVIOUS LISTS 

MJre 6.4. Recall a# a function of previous lists lecrned 
as determined from a number of studies 
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previously-memorized. The curve in Figure 6.4 represents a simple problem inJ 
research integration; it could be fit adequately with a logarithmic curve or- 
many other alternatives to a straight line. But the problems presented by many 
other quantitative independent and dependent variables are more complex. Consider 
the relationship between class-size and" educational achievement. 

A Modificati on of Multiple Linear Reorpt^inn 

A simple statistic is desired that describes the relationship between class- 
size and achievement as determined by a study. .No matter how many class-sizes 
are compared, the data can be reduced to some number of paired comparisons, a 
^smaller class against a larger class. Certain differences in the findings must 
be- attended to if the findings are later to be integrated. The most obvious ' 
differences Involve the actual sizes of -smaller" and "larger" classes and the ' 
scale properties of the achievement measure. " The actual class-sizes compared 
must be preserved and become an essential part of the-descriptive ^asute. The' 
measure^nt scale properties can be handled by standardizing all mean differences 
in achleve^nt by dividing by the within group standard deviation (a method that 
is complete and discards no informtlon at all under the .assumption of nomal . 
^distributions). The eventual measure of relationship" seems straightWrJ and 
unobjectionable: * 



X - X 
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where 



estimated rnean achievement of the smal ler class which contains 
S_ pupils; 



is the estimated mean achievement of the larger class which conl^ains 
L_ pupils; and 

a is the estimated within-class standard deviation, ajsumed to be 

hofnogeneaus across the two classes. ' . ^ ' 

As a first approximation to studying the class-size and achievement rela- 
tionship, it is considered irrelevant that the particular types of achievement 
^ that lie behind the variable, X are quite different knowledges and skills measured 
in quite different ways. 

.If distributional as-sumptions about X_ are; needed to add ^meaning to particu- 
lar values of normality will be assumed. For example, suppose A<.' , « +1'. 
Thren assuming normal dist^butions within classes^^e average pupil in the smaller 
class scores ,|t the 84th percent;j€ of ttie" larger class; These i-nterpretations 
are occasion^.ly hfel^ful , but seldorrf' critical , and our investment in the nomality 
assumption js*iot y^t- It woSW^be no surprise nor any concern if the assumption 
proved .to be more or, less wrong, and it's probably not far off in most instances. 

There exist several' al ternative statistical techniques for integrating a 
large set of A2_^*s. S:o. as to describe the aggregated findings on the class-size 
and ach1e"venient relationship. A large, square matrix could be constructed in 
which the rows and Columns are class-sizes and the ceW entries are average 
values of nearly equal values of average deltas could be connected by lines 

to fonn "iso-^eltas" in much the manner as economic equil ibri urn, curves are used 
to depict ,three-varia^e relationships/ Or a variation of psychometric scaling " 
could be employed: a square matrix of class-sizes could be constructed for 
which e^h cell entry would be the proportion of times the row class-size gave 
achievement' greater thSn the column class-size. This matrix could be scaled by 
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means of Thurstone's Law of Comparative Judgmen"t, which would locate the c'lass- 
sizes along an achievement cofltinQum. (This method was used and the results 
were reasonably satisfactory.)' Finally, regression equations could be con- 
structed in which is parti t4oned into a weighted linear combination of S 
and L and functions thereof and error. 'There is much to recommend this latter 
procedure, and the technique eventually employed is a variation of it. But the 
regression of L^_^ onto only S and L requires three dimensions to be depicted.' 
Anytning more complex than a simple two-dimensional curve relating achievement 
to the size of class was cons idered "unoesi rably complicated and beyond the easy 
reach of most audiences wno htsid a stake in the results. 

The aesire to depict the aggregate relationship as a single-line curve is 
confounded with the problem of essential inconsistencies in the design and 
results of the various studies. A single stucy of class-size and achievement rr^y 
yield several 'values of In fact, if k different class-sizes are compared 

on a single acMevement test, k(k-l)/2 values of ^3,^ will result. This set of 
L's from a single study will form a consistent set of values in that they can be 

joined to form a si ngle connected graph depicting the curve of achievement as a 
function of class-size.. However, various values of ^3,^ arising from different 
studies can show confusing inconsistencies. For example, suppose that Study #1 

^10-15' ^10-20' ^15-20'/^^ Study #2 gave ^^5.3^. L^.^^^, and 
A few moments reflection will repeal .that there is no obvious or rimple way to 
Connect these values into. a single connected curve. 

The eventual solution to these problems proceeded as follows: ^^3 [ was 
regressfed onto a quadratic function -of S and L by ^eans of the least-squares 
'rriterion; then that set of values of Z that could be expressed as a single, con- 
nected curve was found. 



The regression model, selected accounted for -variation in by means of S. 
1' and- L. • Obviously, something more than a simple linear function of S and L 
^ was needed, otherwise a' unit increase i. .class-size would have a constant effect 
regardless of .the starting class-size S; and the S^* term seemed as capable of-^ 
filling the need as any other. The size differential between the larger and 
smaller class, L-S. was used in pjace of L for convenience. Thus, the A 
values were used yto fit the following model: 

Fitting this model by leajt-souares will' result in the curved regression surface 
^S-L = -0 * M + is' + i (L-S) 
_ .r,e problem now is to find tne set of Z.'s in this surface that can be 
deputed a^ a single curved-line relationship in a plane. The property that must 
Hold for a set^of Z's before they can be depicted as a connecteo graph in a plane 
is what might be called the consistency property : 



f 



^or n^<n2<n3.^ If this property is not satisfied, then one is in the strange 
Situation of claiming that the differential achievement between class-sizes 10 ' 
and 20 is not the sur. of the differential achievement from 10 to 15 and then from' 
15 to 20. 

When the consistencKproperty is imposed on^(4), it follows that: 

"^0* Vl* Vh63(n3-n^) _ -v 

Simple algebraic reduction of (5) produces the following: 

K'hz'h'z'^ ' '(6) 
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^^The two solutions to the quadratic equation in (5). are points n^ such that 
If A3_^ is measured with n^ as either^he larger, L. or smaller. S, class-size. 
^ then tne resulting set of I', will lie on the four dimensional regression curve 
' in {^0 but can be depicted as a single line curve in a plane. Since n^ becomes^ 
the point around which values of n^ and n3 are selected. it.will be called the 
pi vot point > y 

A Logarithmic Model 

Tne anove modifier regression approach for integrating studies wi tn quanti- 
tative incependent vanables is disappointingly complex. Fortunately we have 
founc two simple^ alternatives: 1) a lo^rithmic model and 2) a non-linear model. 

-he logarithmic model can be illustrated wi tn the class-size problem. 

Assume that the L for a comparison of class-size 1 'and an/ other class-size ' 
C has the form 

= BlogC * e. where e ^ (0. ). 

_ Now consider the values of C denoted by S and L which stand in the relation- 
ship S < L_ . Then, 

L^_^ ' Slog S + e. and 
« Slog L + e . 

Assuming, quite reasonably that 

^S-L " '^l-L ' ^1-S • ^^^^ 

^S-i ' Slog (S/L) + e. ' (7) 

Thus, the parameter B can be estimated by simple least-squares regression of 
L^^^ onto log(S/L). Tnen a single curve depicting the relatiojis]y p of ^ to C can 
be drawn in a plane aefined by the two axes C and 2, the (in the calculus sense) 
of^ii. We have applied this model in the analysis of class-size and achievement 
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with very satisfactory results. It fit the data with lesser mean-square error 
than did the linear regression approach described above. Furthermore, this si^nple 
logarithmic model presents far more tractable problems/ of statistical inference 
than the modified regression model. 

A Non-Linear Model 

A third alternative exists. Its comparative advantages will be pointed^ 
out later. < ^ 

Suppose that a stuay of the reVationship of class-size and achievement is 
done in wnich achievement is compared in classes of size n^ . n^ anc'n3'. The average 
acnievement in each grpup is Y ^ . and Y3 . A simple model for the relationship 
between achievement and class-s.ize in tms study could take the following form: 



1 



The parameter u represents a hypothetical le^l of achievement at class-' 
size zero (i.el, X « 0). The parameter r is an arbitrary scale of measurement 
parameter. If g is restricted to the interval 0 to 1 , tnen the curve described 
is an expoentiai ttiat does not drop off as fast as the logarithmic curve. For 
example, the following table shows tne decay in achievement as class-size increases 



wnen S ■ . 90 
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Tatle 6.5 

Comprarison of Non-Linear and Logarithmic Models 



0 
1 
2 
4 
8 

16 
32 



u 

,90u 
,81u 
. 55u 
,43u 
,19u 
.05u 



Based on 



u 

.50u 
.33u 
.25u 
.20u 



In the tnird column above, the- rate of decay "for the logarithmic model is 
given for comparison. As can be seen, the non-linear model drops off much less 
rapidly for small values of X_ . 

The non-linear model can easily be .adapted for integrating many different 
studies by allowing u and c to vfiry. depending on the study. By introducing a 
coding variable w. which equals 1 when study j is considered ind zero otherwise, 
the following integrative model is obtained: 



Tr. 



xi 



(8)' 
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This integrative non-linear model has -ZJ + 1 unknown parameters and J • K 
data points, provided that each study has K means; if at least one study has 
-three means, the model parameters can be estimated' by means of non-linear least- 
squares analysis. 

The logarithmic model in (7) would fit data well where the drop off was 
severe for small values of the quantitative indpendent variable. But the log model 
has no asymptote, which is-6ften a disadvantage. The non-l inear -model in (8) would 
fit data well where the initial drop was" less severe, but where an asymptote 
was approached for large" values of X . It ought to be possible to combine the ' 
two models additively into a mixed model and gain the benefits of each. 

The Loqaritmic Model Illustrated • 

Consider an illustration from research on class-size and achievement, 
y Fourteen experiements were found in which pupils were randomly assigned to 
tlasses of different" sizes. These fourteen studies yielded over 100 separate 
ciwiparisons of achievement in smaller and larger clas^'ses. The multiplicity 
of findings is due partly to the fact that in one study there may exist 
several pairs of class sizes and partly to the fact that a single pair of 
class sizes may have been measured on mor" than one achievement test. The 
latter nwltiplicity was averaged out and, the former retained in the s'unmary 
of 30 data points in" Table C.6. 

' ^ One might expect class-size and achievement to be related in something 
of an exponential or geometric fashion— reasoni ng that one pupil with one 
teacher learns some amount, two pupils learn less, three pupils learn still 
less, and so on. Furthermore, the drop in learning from one .to two pup1*^s 
could be expected to be larger than the drop from two to three, whici in turn ' 
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Table 6.6 



Data on the Relationship of Class-size and Achievement from Studies Using 
Random Assignment of Pupils. 

(Outcomes scaled with ^ . (s + g )/2.) 



Study 
Number 



Size of 
Smaller 

Class 



4. 
5. 
5. 
5. 
5. 
5. 
5. 
6. 
7. 
7. 
7. 
7. 
7. 
7. 
8. 
9. 



Size of 
Larger 
Class 



25. 

3. 
25. 
25. 
35. 
112. 

2. 

5. 
23. 

5. 
23. 
23. 
30. 
23. 
30. 
,37. 
30. 
37. 
37. 
28. 
50. 
32. 
37. 
60. 
60. 

8. 
45. 
14. 
.30." 
30. 



loQgd/S) 



1. 

1. 

1. 

3. 
17. 
28. 

1. 

1. 

1. 

2. 

2. 

5. 
15. 
16. 

II: 

23. 
23. 
30. 
20. 
26. 

1. 
15. 
15.' 
37. 

1. 
15. 

1. 

1. 
14. 



In 25.0 
In 3.0 
25.0 
8.^ 



In 
In 
In 
In 
In 
In 



In 
In 
In 
In 
In 



2. 1 
4.0 
2.0 



5. 



In 23 



In 
In 
In 
In 
In 
In 
In 
In 
In 
In 
If) 



2 
11 

4, 



2.0 
1.4 
1.8 
2.3 
1.3 
1.6 
1.2 
1.4 



In ^1.9 
In 32.0 
2.5 
4.0 
1.62 
8.0 
3.0 
In 14.0 
In 30.0 
In 2.14 



3.22 
1. 10 
3.22 
2. 12 
.72 
1.39 
.69 
1.61/ 
3.14 
.92 
2.44 
1.53 
.69 
.36 
.63 
.84 
.27 
.48 
..21 
.33 
.65 
3.46 
.90 
1.38 
.48 
2.08 
1. 10 
2.64 
3.40 
.76 



^S-L 



.32 
.22. 
.52 
.22 
.29 
.03 
.36 
.52 
.83 
.22 
.57 
.31 
,17 
,05 
,04 
,08 
04 
04 

15 
29 
65 
40 
25 
65 
30 
07 
72 
78 
17 



I5 - X 

A. » T = T 

S-L (S3 + 5^)72 



n « 14^ studies 
' N » 30 comparisons 



1.42 
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1s_pro.ably larger than the drop fro. three to four, and so on. A logarithmic 
curve represents one iuch relationship: 



y ' a 'B log C + c . where 



(9) 



C denotes class-size, 
in fomuh (9). a represents the achievement for a "cUss" of one person 
since 105^1 . 0, and r. represents the sp«d of decrease U achievement as . " 
class-size increases. The general curve is graphed in Figure. 6... • 



o 



5 10 rs 20 25 3U 

Figure 6.5 Graph of tlie Ion curve for the model in formula 



rs 



Fonnula (S) can not be/itted to data directly because Y is not 
tneasured-on a co;nmon scale across studies. This problem can be circumvented 
by calculating t^^^ for each comparison of a 'smaller and a ^arger class 
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Figure 6.6 



•Scatter Diagram of a 



S-L 



Graphed Against Log (L/S), 
(Points numbered by study) 



^S-L ' Slogg_{L/S) + e 



r « .54 r^ 



.42 



11 



2 J 
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within a study. Then, from formulas (7) and (9) one has 
Ag.L " (a-Blog^S + c^) - (a-Blog^L + t^) 
= BdoggL . loQgS) + -'e^ 

' elogjL/S) + r.. (10) 

The model in fomula do) is particularly simple and straightforvard. 
The values of are merely regressed- onto the logarithm of the ratio of 
the larger to the smaller class-size, forcing the least-squares regression 
line through the origin. 



^ - ^' 

, A scatter diagram of the data -in Table 6.6'appears as Figure 6.6, in which 
^s.L is graphed. against logg(L/S). fYie estimate of 6 for these data equals 
0.2^ The value of r is .64, and r^ = .42. The resulting curve relating class- 
size £ to achievement. in standard-score units appears as Figure 6.7. 

One can either weight each L^^^ in Table 6.6 equal'ly in deriving- an 
estimate of 6. or it can be reasoned that each of the fourteen studies should 

receive equal .weight so that each , ■ ' ^ ^ ^,,,2 . ^ 

. S-L IS mul|iplied by 2/(k'^-k) when it is " 

derived from a study involving k differen^class-sizes'. The estimate of 6 

from the re^Vession involving weighted L's is equal to 0.2l, which agrees 

closely with the earlier result. , 
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Figure 6.7.^ Data in Table 6.6 fitted to the log mod^l/ 

o . \ ' • ' 

An AUernat^ve log Model . - • ■ 

A model may have advantages if. it avoids highly interdependent data- sets 
• c<eatid(as in- the first model ? by taking all pairwise differences in a study. 
Such an altema-tive model can be -developed along the following li/ies. 

Let y^ and s^ be the me'an and standard devlat^or^f the dependent vari- 
able fcJ^^class-size C in one of m studies. For ^e k_ class-s1z"es in a parti- 
cular study., order the groups from < ... <C^. Arbitrarily set ' 



\ 

\ 
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6,^ ■ 0 ; then. 



■'k-2 '^k-l 



^k.3 " ^-2 

















*'k-'i'/2'. 








• and so 







"1 



The datj frorfi the fourteen class-size experiments have been scaled via • 
formula (12) and are recorded in Tables 7 ' • ' 

:The-fol lowing model can be pes tii 1 ate J^Tor data of the fpm, In (4); 

The .-0 terns fn p3) represent ^u™, variables and arbUrary level para- ' 
™ters for the .separate stu*ie.; D, • Mf a 6 ,n question co.es 'fr^ the 
ith .tud.. and It epuals zero othe^ise. The parameters '\ and (. ' a , 

ca.^e est,™ted b. regressing ^ onto log^c. We have done so for the data " 
.n Table 6.7 and obtained a lighted least^uares estimate of 6 equal to 0.22. 
The estimates of the a's.are un,.,portant. ,n this reoression, each « was ; 

weighted so that each of thfe 14- studies would recei ve^equal weight. 

The result is virtually identical to 'that/)btained f-or the ntodel in (10) 

The^del in (13) is more general ar>d of more significance than the mt)del 
in (iO). ^odel (13) can be appl^d in a wide range of circumstances in whic+i 
studies with. quantitative independent variable are integrated. The first 
log term in (13) can be re'placed by any' mathematical function ^appropriate \o 
a particular application. \The important point abbut model (13) is^that it 
Simultaneously resolves the problems presented, by different scales of measure- 
ment of Y and different values of X compared across studies. 
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. Table 6.7 ' * 

Data on the Relationship of Class-size and Achievement from Studies Using 
Random Assignment of Pupils. 






• 




study 


Size of 






Number 


Class 










TO 


* 


\ . 


* w • 


U 




12. ^ 


I . 


1 . 44 






3 . 


1 70 




2. 


25. 






3. 


1 7. 


• 2Q 




3. 


35 . 


ft 
w 




4. 


♦ o . 


• w J 




4. 


112. 


0 




5. 


t . 






5. 


2 . 


^T " 




5. 


5. 


- 3 1 




5. 


23 . 


n 

w 




6. 


15. 


. 17 




6. 


:ro. 


0 




7. 


16. 


.09 




^ 7. 


23. 


. 04 




7. 


30. 


0 




% 

7 . 


37. 


0 • 




B. 


20. 


. 15 




8. 


2R. 


0 




9. 


26. 


.29 




9. 


50. 


0 




10. 


1 . 


.65 




! 0 . 


32. 


0 




M . 


15. 


1 .05 






27. 


.65 




t t . 


GO. 


0 




, 12. 


1 . 


.30 




12. 


^. 


0 




. 13. 


15. 


.07 






45. 


0 




1 4. 


1 . 


.95 




1 4. 


14. 


. 1 7 




1 4 . 


30. ^ 


0 




o 
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Non-Parametric- Integration When the. 
Independent Variable is ■Quantitative 

The methods of the previous section assume a model for the relationship T 
between. the dependent and a quantitative independent variable. Starfdardized ' 
c,bntrasts of . the form are used to estimate the" parameters of the-model. 

In many instances, too little will be known about the relationship to hypothe- 
size even an approximate model. Then, perhaps, an approach modeled after 
Iukey'<5 methods of exploratory data analysis might be more appropriate (Tukey, 
l\^7). No functional relat1onsh1p_need b* hypothesized, and. the data themselv^ 
wtll determine the shape of the curve. An example will help cl'arify the approach, 
which may differ in details in particular appllc^^tions' ' . * 

• Andrews. Guitar^n^ Howie (19 79) performed' a meta-analysis of experimental 
studies of stuttering therapies. Effect sizes were calculated for 42 studies; 
all studies were pretest vs. 'posttest designs withpuf control groups. Effects 
were assessed by comparing the post-test mean against the pretest mean and 
standardizing by the pretest standard deviation: 

; ' A . ^post " -^pre 

^E-C s ^ • (14) 

The 4? studies yielded 116 A's. These L's were cUegori^d by the type of" ■ 
therapy applied, the duration of the therapy, type of outcome measure, and 
several other features of the therapy and the, clients. Differences in Average 
effect were obtained across types of therapy: Prolonged Speech'therapy gave 
a -1.65 for 47 effects; at the other end of the scale. Systematic ^ 
Des#nsit1zation gave a I^.^ - 0.5.4 for 5 effects (Andrews. Guitar & Howie. ' 
1979; Table 3). No corre-lation was foutid -between iffe number of months after 



^ therapy n w»ich effects wgre measured and the size of effect. This lack of 
• . 235 24:) 



correlation seemed surprising and prompted the further' search for a decay of ' 
effect across time that is reported below. The "follov-up ti™" variable and 
type of'therapy are confounded tn the Andrews. stuttering data set. For « 
example, Airflow «,erapy showed an average i'of 0.92,- but these outcomes were 
wasured at 4.2 months after therapy on the average. On the other hand. • - 
Attitude therapy showed a I .^b5 for an average follow-up time of 3.3 months. 
The only. real difference betwe^Atti tude and Airflow average effects mignt ' 
be attributable to varying follow-up times for measurement of benefits." ■ 
Likewise, the effect of different follow-up times may reflect therapy dif- 
ferences. For this reason, the palttern of decay in effects across tim* shoulo< 
be examined separately within each type of therapy. But -another feature of 
the studies is also confounded with follow-up time anS should be likewise „ ^- 
controlled. Therapies differed with respect to the attention aiven to providing 
for post-therapy maintainence of the gains made during therapy. Andrews and 
his colleagues classified each study by whether there were many. , some or no 
provisions made for maintainence of gaTnTachleved during therapy. Thus, it 
seemed sensible to cross-classify effects by therapy type and maintainence 
provisions before examining the data for the decay of treatment phenomenon. 
Thus, 107 of the 116 effect sizes were cross-classified into the cells of an 
8 x-3 (therapy type x fhaintainence provision) table, and the ce^l .entries were 
averaged. ' * " \, 

The averaging of effects resulted in'an 8 x 3 table (see Table 6.8) The 
typical entry is a triplet of numbers' of the form (a, b, c), wher^ a Is the 



• fdllow-up time In months, b Is the average and c is the number of "values 
averaged. Within a cell of Table 5. S the entr^tjLere graphed in a connected • 
line. Consider, for example, the cell for R^thm therapy with many provisions ' 
^ for maintainence. The four data points can be graphed, as shown- by the solid 
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Table 6 


8 




Follow-Up Time, Average Effect Size and Number of Effects Averaged 




W 1 Cl> > > i c U \jj i Jr \J 


e of Therapy and -Provisions for Maintainence 






Maintainence Provisions 




Therapy Type 


1 : None 


2: Some ' 3: Many 




Airflow - 




1, .88, 1 
"3. .74, 1- 
16. .85, 1 




Rhythm 


0 , . 66 , 1 m 

14. .76^1^2 


0, 1.26, 7 
6, '1.57, 2 
9, 1.60, 10 
12, .86, 4 




Shadow 


0, .17, 1 
14, .38, 1 






^Gentle Onset 


0, 1.12, ? 

1, 1.38, 1 
10, 1.12, 1 
25, 1.15. 1 


0, 2.37. 2 
10, 1.52, 2 




Biofeedback 




0, .88, 2 
12, 1.03. 2 




Attitude 


0. .71, 7 

9. i.n. 4 






Prolonged Speech 


0, 2.02, 6 
'3, 2.42. 2 
6, 1.27, 2 
9, 2.17. 3 
11. 1.77, 1 


0, 1.62, 9 
2, 2.02, 3 
12, 1.16. 8 
15. 1.16, ,8 
18 1 36 3 




Desensi tization 


0. :b9. 1 

1, .89, 1 
20., 1.07, 1 


1, .01. 1 
3, :03. 1 


r 
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line in Figure 6.8. -The broken line represents the three data ^ints from Airflow 
therapy at the second maintainence level. The elevation of either line on the 
graph is immaterial; only the slope cyf the line relative to the abscissa Is 
significant. The number in parentheses beside each line is the average of 
the number of effects, ,,that exist at each end of the'line; for examplfe, 

U"a f 

the first segment cf the solid line 1s based on 7 A's at zero months and 2 L's 

at six months—hence the weight {7^Z)/2 » 4.5 for -the line serment. 

One aoproach to aggregating the data on slopes is t-9 take a weighted 

average of all the lines aoove two successive months. For example, the slope 

of the solid line in Figure 4 between months 1 and 2 1s -^.05 « '^-I ^ ; 

. o\mos. 

0 

the slope of the broken *11ne is -.07. Since the weight for the solid line 
segment Is 4.5 and for the dashed line, r.O, the weighted average slope between 
months 1 and 2 is [4.5(.C5) + -KOi-.O;)] /(4.5 ^ 1.0) « + .028. 

If the above procedure were repeated for each successive pair of months 
and for all twelve lines that can be drawn from the data in Table 5, a complete 
aggregate curve is obtained^ Such a curve is depicted in Figure 5. The. curve 
shCMS a loss of benefits over the first twelve months after termination of 
therapy; the average loss is rougnly one-half standard deviation. Although 
the general trend in the curve is unmistakably downward, not every i ntennedi ate 
Jwist and curve is to b.e taken seriously as a stai)le, repHcable feature of 
the true relationship. Even though approximately twenty L's are still 
determining^the sloj)e of the aggregate curve in Figure 5 at 12 months post 
therapy, the estimates of the points on the curve are probably subject to a 



fairly large sampling error. Inferential techniques, perhaps drawing on Tukey's 
Jackknife procedure (Hosteller and Tukey, 1958), would illuminate the question * 



of the rel iabU i ty of the determination of the curve. 



> ' ' - ' ^ ■ ' 

1 2 3 4 5 6 7 8 9 10 11 12 
Follow-up Month 

Figure 6.9. Aggregation weighted averaging of data in 

Table 5 on the decay of stuttering therapy effects. 
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Aggregating Linear Slopes 

' An alternative approach was applied to the analysis of deterioration 
effects. This apptt>dch could be characterized as parametric to distinguish it 
from the- non-parametric method illustrated above. Within each cell of Table 6.8. 
a straight trend line was fit to the (t, I.) data by means of least-squares, i,e,, 
the following model was fit by least-squares: 

I. « ^ £^ t c , where 
Z is the average effect 

t is the numDer of months after treatment that the dependent 
variable was measured. 

These individual cell analyses number eleven. In eacjj, estimates of 
and 6^ were 'obtai ned; in addition, the average number of A's for the data 
points in the cell was obtained. For example, for the cell "Airflow/Some 
Maintenance Provisions" i^rS^ble 5.8, the regression of A?onto t for the three 
data points gives = .81025 and §^ « .00245. In addition, since each I. was 
based on n « i ."^he average n_ is n « 1. .The regression equation spans the time 
interval 1 ta 16 months, with a weight of n" « 1.50 and gives « .56000 and 
£^ « .00714. In TaDle'6.9 appear the within cell regression lines, the follow- 
up interval spanned and the rf-weights. .» 

The information in Table 5.9 can be integrated into a Single curve by 
taking" the n-weighted average of all slopes, . Only those slopes are 
averaged at time point, t » t. whixh were derived on data from a time interval 
that spans t . . For example, the aggregate slope at t « 0 is a weighted average 
of all 6/s in Table, 6.9 except those for "Airflow/ Some" and "Desensitization/ 
So^" which were based on intervals tnat begin at t « 1 month pcst-therapy. 
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Table 6.9 

Within Cell Regression Lines » Time Interval and n"-weights 
for the Data in Table 5.8 



Regression of Time Interval 

Therapy/Maintenance A onto t: Spanned _ 

Provision Combinat'ion g (in months) n-weight 



Airflow/Some 




.81025 


.00246 


1, 


3, 


16 


1.00 


Rhythm/None 

V 

Rhythm/ Many 




.66000 


.00714 


0, 


14 




1.50 


1 


.45685 


-.'01990 


0, 


6, 


9, 12 


5.75 


Shadow/None ^ 




17000 


.01500 


0, 


14 




1 .00 


Gentle/None 


1 


22832 


-.00398 


■ 0, 


1, 


10, 25 


1 .25 


Gentle/Many 


2 


37000 


-.08500 


0, 


10 




2.00 


'Bjjjfeedback/Many 




88000 


.01250 


0, 


12 




2.00 


Attitude/None » 




71000 


. 04444 


0, 


9, 




5.50 


Prolonged Speech/None 


2. 


08383 


-.02652 


0, 


3, 


6, 9, 11 


2.80 


Prolonged Speech/fliny 


1. 


79433 


-.03514 


0, 


2, 


12, 15, 18 


6.20 


Desensi ti zati on/None 




78026 


^.01472 


0, 


1, 


20 
• 


1 .00 


JDes ens i t i za 1 3 0 n / S ome 




00000 


.01000 


1. 


3 




1.00 




Her>G€, for t « 0, 

• D.50(. 00714) + 5.75(-.0199Q) +•••+ 1 .0Q( .01472)] 
r (1 . 50 + 5.75 +••••+ I'.OO) « -'.0094. 

So the inclination of the curve at t^ « 0 is .0094 units downward. At 
1, all twelve of the regression slopes in Table 6.9 are averaged because each 
of the regression lines was determined across a time span that included t = 1. ' 
The IT-weighted average is ' * - ' ^ 

6 ' [1.00(. 00246) ^ K 50(. 00714) ^---^ 1 .00( .01000) J 
• r (1.00 ^ 1.50 ^v-* 1-00) * -.0084. 

In this manner, the aggregated slope of the curve is determined for eacih 
month from t^O to t^ * 17. The resulting aggregate curve is graphed along 
with the previously derived non-parameiric curve in Figure 6.10. 

In Figure 6.10, it is clear that the curve based on the weighted averaging 
of fitted straight lines is smoother and more regular than the non-parametric 
cdrve. This feature seems an advantage since the true curve of effects plotted 
against follow-up times probably wouldn't follow the jagged, irregular path 
of the npn-parametn c curve. But the aggregated curve based on linear slopes 
appears tcu have attenuated the size of the effect decay across time. .For 
example, between 2 and 12 months, the non-parametric curves drops about ,40 
standard deviation units. Over the same- interval, the curve from aggregated 
linear slopes drops only about .15 standard deviation units. This difference 
is so great as to cause one to search for a compromise solutiop. 

243 



§ 

c 
o 

to 

> 

T3 

TJ 
U 

C ' 



U 

H- 
UJ 



1.00 



,90 



.80 - 



70 



Non-parametric 
aggregation 



Aggregation of 
Jc linear slopes 




.60 i 



10 



15 



Follow-up tinie In months 
Figure 6.10. CompaiHon of non-parametric and linear methods of curve aggregation 
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^ Aggregating Quadratic Slopes ' ' . , 

Fitting qi/adratic functions by^ Wast-squares estimation within'each cell 

of TAble 6,8 may produce a jnore satisfactory aggregate curve. Consider, for 

^^exampVe, the cell "Airflow— '5ome" Maintenance Provisions." The three pairs 

• ' ( 

of points are as follows: 

' - r ^) '- • 

' . Follow-up time, t: 1 3 15 ^ - 



Effect size, L:' .88- .74 .86 

These points can be fit to the quadratic equation 

A. ^ 6q + 6^ t + B^t^ + e. ^ < • V 

With three points and three parameters in the model, the fit of the ewqation ' 
^ \ / 

IS perfect: 

^ ^L. » .9558 T .09nt +..0053 t2. 

" For example, at t = 1 , the predicted effect is .88; at t =-3, A « .74; 
at t » 16, A ■ .86. , 

This single quadratic curve spans the time Interval f rom -1 to 16 monthy. 
Jts slope at any time _t on the interval is given by "the value of the i^riv^ve 
. of the. curve at the po^nt _t- In general, the slope of the'curve at t, iS' given 
by . . , ' ' 

^ Slope(t^) = + 26t^. ^ 

' For example, the slope of the quadratic curve for "Airflow— 'Some 
' Maintenance". at '2 months post-treatment is 



Slqpe(t « 2) = ;n^J.9658 - .09nt +' .0053t^) 



dt 



t = 2 



'■^ -.0911 + .0106tL , = -.0699- 
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In words ,* then, the quadratic curve fit to the data has a slope of .07 
Y standard deviation- units downward at two months post- treatment . 

This method of fitting quadratic curves can^be applied to each cell of 
Table 6.8, provided that more than two f ol Iow^ud times are present in a cell (at 
4east three data points are required to estimate the three parameters of the 
quadratic curve). Consequently, six of the 12 non-empty cells in Table 6.8 must 
be eliminated.. (An alternative a^propch not explored here would entail fitting 
straight lines in those cq^ls'with only two points and Ig^ter aggregating their 



slopes with the slopes from the quadratic curves. This mixing of quadratic -and 
'straignt line mo|pis is probably preferable to the elimination of two-data-point 
eel 1 s' fol lowed here. ) 

For each celljf^ith sufficient data, a quadratic curve can be fitted via 
least-squares. Then the curve is differentiated to obtain the function describing 
tne slope of the curve at any time t. These slopes can be calculated for eaciv 
value of Jt (to the nearest month, for example) across the^time interval spanned 
by the data^bn which'the curve was derived. Finally, for each value of t^ the 
slopes of the derived curves cJh 1)e averaged, or averaged after some appropriate 
wei'ghting, to form an aggregated curve. For the six quadratic curves fit to . ■ 

l^he data in Table 5.8, each slope was weighted by the average number of eff-ect 

> 

sizes in the cell (the same weight function applied in aggregating the data* 
by the non-parametric and- linear meithods above). 

The re*§ults of^ fitting the quadratic Curves, the time span over which the 
curve stretches and,, the .wei ght (average number of 's)- for the six cells appear 
as Table 6.10. Suppose one wi shed -^b ^ca Icul ate the aggregate slope of * the follow- 
UP curve -at t-« 16 months post'- treatment. From Tab]e g.lO'-^'ft is seeh that four 
cells contribute data to determining follow-up effects $t 16 months: airflow- 
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Table 6 


.10 






Quadratic Curves, Follow-up Ti 


ipe-S^ans and Weights 






(Average Number of ^'s) for the Data in Table 5 






time Span 

Poll { A r% mrs n ^ \ I J m U ^ 

V'Cii pn montnsj Weight 


0 1 






4 _ 

Airflow-Some ^ '1-16 1,00 


.9658 -.0911 


.0053 












Rhythm-Many 0-12 5.75 


1.2413 .1741 


-.,0168 




• 

Gentle Onset-None 0-25 1.25 


1.2471 -.0146 


.0004 




Prolonged Sp'eech-iNone 0-11 2,80 


2.1571 -.0318 


.0050 




Prolonged Speech-Many 0 - 18 • . 6,20 


1.8599 -.0857 
* 


.0029 




Desensi tization-None 0-20 1.00 


0.5900 .2095 


-.0095 


• 


* 


r 
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some, gentle onset-none, prolonged speech-many, and desensitization-none. The 
first derivatives of th-e quadratic curves for thes^ four cases and the weights 
assoc^ted with e.ach curve are as follows: 

' First Derivative Weight . 

Airflow-some * -.0911 + .0106t 1.00 

Gentle-onset-none -.0146 + .0008t 1.25 

Prolonged speech-many -.08^7 + .0058t . 6.20 

Desensitization-none .2095 - .Q190t -1.00 

The^aggregate slope at t = 16 is founa by solving each first derivative 
at t « 16 and then forming the weignted average of the resulting four values: 



1. oof. C70,5') ^ 1.25(-. 0018) ^ 6.20(.0P71) ^ 1 .Q0(- .0945) , 



iQ + 1.25 + 6.20 + 1.00 



Thus, the slope of the follow-up curve at 16 months is a rise of three- 
thousandtns of a standard deviation per month— imperceptibly different from a 
horizontal line. In similar manner,' the slopes of the. quadratic curves in • 
• able 5.10'were aggregate'd for each month, from 0 to 17 and conposite curve 
reflecting the proper slope at each month was drawn. This curve, referred to 
as the "aggregation of quadratic slopes" appears flong with the non-parametric 
aggregated curve in FT^ure 6.11. ' ' . ' _ • ' 

The aggregation of quadratic slopes clearly overcame the mani^t short- 
4j||ming of the method of' aggregating linear slopes, viz.', the atten^^on of 
effects. .The quadratic curve is much more like the non-parametric curve than 
was the aggregation of linear slopes. <^ 
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INFERENTIAL METHXS OF META-ANALYSIS 

f 

-ole of statistical inference in meta-analyses is somewhat 
controv^ipral . Inference at the level of persons w\thin studies (i.e., 
methods tiiat treat persons as tne unit of analysis) seems quite unnecessary; 
the rejection of hypothreses in such cases ts nearly automatic and pro forma 
since even small integrative analyses encompasing twenty or so studies are 
likely to involve several hundred persons. 7he^ picture changes when one 
consiaers "studies" and tne variaoil^ity produced Dy tnei r .cnaractenstics 
(e.g., location, 4 Gate, investigator, types of subject, and the like}. At 
\n^z second leve., one car readily imagine tnat ever ^i^'ty or 130 stu-aies * 
may yielc unstaole findings, regardless of wnetner tney subsume data from 
a tnousanc or many tnousand persons. An investigator wno subtly cormuni- 
cates nis expectations of outcomes to his subjects affects all c^ them 
ecually, and there is nttle comfort in there being 100 subjects or 1,000^. 
So if any type of statistical inferenci ought to be undertaken in an 
integrative analysis, it snould be^ carried out witn "study" rathef than 
"person" as the i/nit of analysis. But tn^e prior question rem^ains: should 
meta-analyses use inferential statistics? 

The answer is, by no means, obvious. Inferential statistics seem 
to work well in ti*o instances: randomized experiments and wel 1 -des i gned 
surveys with explicit sam'pling procedures. The classical theory of 
statistical inference assumes either the <lefinition of a populat-ion and ' 
"Vigorous sampling from it or, as Fisher later showed, the randomization of 
units among conditions of an experiment. It works sensibly there; there 
is little doubt in these applications about what is meant when it is asserted 
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that the confidence intervals cover tne parameter with 95% probability or 
that tf^e probability of tne hypothesis being rejected incorrectly is 1*. 

|he typical integrative or Tneta-ana lysis seldom meets either condition 
of valid statistical inference. An attempt is made to locate every study 
on tne topic being exami nedy^Those studies that are located consl^^^tute a 
, portion of a population of stucies; but one hopes that the proportion is 
close to lOOi, and one is under no illusions about the group of studies in 
hand being a random or probabilistic sample of tne population*. Rarely, a 
meta-analysis will be unaertaken on a literature so large that it is 
impossiDie tc read and analyze it all, even thougn one can describe, count 
anc Gtnervr'ise delineate the population stucy. Then one mignt sensibly 
draw random ;or stratified, cluster, two-stage random) samples of studies 
and apply. classical inferential tecnniques with a legitim^ate warrant— as 
Miller (197S; was forced to do in nis meta-analysis of the effects of 
psycnoactive drugs. ^ , * . 

Tne p^obaDility conclusions of infe'^ential statistics depend on 
sometning l:Ke probabilistic sampling, or else they make no sense. TKere ' 
can be no question wnetiier tne relationship of a meta-analysis sample of 
s^uflies to tne population is similar to tne dxperi^nental randomization 
upon wnicn Dermutation test theory rests. It is not. 

The arguments against Tn^erential techrliques in meta-analysis do 
not satisfy the appetite for some indicatioQ of the instability or 
unreliability of the results. When we 'snowed our early work on psycho- 
therapy (Smith and Glass, 1977) to John Tukey, he chided us for not 
presenting standard errors of the more important averages. 'Our reqjtation 
of the reasons for not i)roaching the inferential questions left him 
unconvinced, he felt that regardless of 'such conol ications , some 
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rudimentary inferential ca Icul a;:ions would oe informative and' useful. 
Si^? then„ we have pursued inferential questions at the ".study" level and 
•through the application of Tukey and Moste lifer ' s jackkni fe" technique (an 
all-purpose approach to statistical inference for conplex data sets where 
-—Classical theory is lacking). • 

Whetner the findings from a co'llectior. of studies are regarded as a 
sampl^ from a' hypothetical jjni verse of studies, or they are in fact a , ' 
sample fror. a well-defined population, proDlems of statistical inference 
arise. Significance tests or confidence intervals around estimates of 
averages or regression planes will indicate where the research literature 
is conclusive od a question .and where tne agg'-egated findings still leave , 
douots at least insofar as sampling error is concerned. 

The inferential statis>ical problems of the meta-analysis -of res&arch 
are uniquely complex. The data set to be analyzed will invariably ^contain 
complicated patterns pf statistical dependence, "itudies" cannot be 
considered-the unit of data analysis without aggregating findings above tne 
levels at*^riich many^'interesting relationships can bi studjes. Each study 
is likel;^ to yield more than one finding. An exper>ment comparing " 
heterogeneous and hcnogeneous ability grouping mi gtitVoduce effect -size 
measures on^three types of school achievement at four ooints in time; thus,' 
12 of the several hundred effect-size measures in an aggregate data set 
would have arisen iron a single study. There is no sfmple'answer to the 
question of ha* many independent units of info.rmation exist in the larger 
data sH- One might attempt to impose some type of cluster ffr multiple- 
stage Jilting framework on the data, but in the §nd this will probably 
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resr^ct the movement of an imaginative data anatyst. Two resolutions of 

the proolem can be envisioned: one risky, the other complex.' 

The simple (but risky) solutionis to regard each finding as independent 

of the others. The assumption is tintrue, but practical. All inferential 

calculations could proceed ok this independence assumption. The results 

(standard errors of means, of correlations, and of regression coefficients) 

could be reported with the qualification that they were calculated under 

tne assumptTon of independence. This procedure might be useful because the 

effect of fne dependence ts almost surely to increase standard errors of 

estimates above wnat tney would be if the same number of data points were 

independent. Thus.Sf 50 effedt-size measures from 30.studies yielded an 

' " C 

unsatisfactorily large standard error for the mean effect size, then it could" 
De assumed safe^ tnat the standard error would be even larger ,if the * 
canplex dependence in tne data were accounted for properly. 

The ^■^tter of statistical efficiency and "lumpy" data can be described 
more formally by appealing to aji analogy with cluster sampling in survey 
researcn. Imagine that "studies" are like clusters and effect' size" measures 
(or r's or any other appropriate description of findings) are li-ke obser- 
vations or cases within clusters.^ It is well-known fron^ sampl ing theory 
(Cochran, 1953) that if m clusters each containing n elements are drawn 
randomly -fnom a population in which the intra-cluster correlation of elements 
is denoted by d, then the variance error of the mean of the mn observations 
IS given approximately by: ' 



Var*(y.) - ^ [1 - Dp] . (15) 
wne-'e ij the homogeneous within cluster varian-ce of the observations. 
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The analogy with appTiiations to meta-analysis can be drawn by 
associating studies witn cly§>4rs ana then c becomes the intra-study cor 
relation of effect-sizes, say.- -^?^s instructive to notice in the above 
equation tnat intra-ci uster (or "intra-stuay" ) correlation changes the 
variance of the mean, from what would be obtained under independence, by 
factor of I + (m - 1):^. It is improbable that p would ever be negative, . 
nence the conclusion that intra-study correlation of findings in meta- 
analyses increases variance errors, thus decreasing tne reliability of 
aggregates from what would be expected under independence. 

f^ortunately , tne results from several extant meta-analyser^n be 
used to investigate what a typical value of c mignt be. Then, the typical 
^.inflation of the variance error of tne mean can be estimated. • In Table. 10 
appear tne intra-study correlation coefficients (of course, these are merely 
intra-class correlations) calculated from the data of seven meta-analyses. 

Only one of the seven p's in Table 10 is below .50; they average 
.61, but they vary greatly about that average. Nonetheless, .60 gives a 
reasonably typical value of p with which to Inquire further. 

Under the assumption af independence of findings within studies, 
tne variance error of an aggregate average of n^ findings within each of 
m studies is given by: 

^ Var(y.) - , 
. mn 

An 'intra-stu^l^ correlation of findings increases the vapiance of , 
the mean to: 

... • Var*(y.) « ^ [l Mm - l).6l . 

The ratio of the latter to the former equals: 

1.+ (m » 1).6 
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Table 6.11 
Intra-5tudy Correlatitsn Coefficients 
from Seven Meta-Analyses | 



Investigator(s) 



Topic 



No. Jof 
^ Findings 
No. of ' W>sifiin 



Studies 



fudies 



Kavale ('79) 



Schlesinger, 
Mumford & 
Glass ('78) 



Psychol inguistic training 



Treatment of asthma 



27 



11 



220 



19 



,24 
,85 



Smith ('80) 



Glass et al. 
, ('77r"" 



Sex-bias in psychotherapy 



Tea'cher indirectness & 
achievement 



34 
19 



50 
34 



.69 

.90 



Glass ('77) 



Effects of psychotherapy on 
anxiety 



Smith ?^ss Psychotherapy 
& Miller ('80) 



26 



60 



39 



185 



51 



,60 



Shavelsoji et al. Stability of teacher effects 19 
(•77) 



52 



.50 



255 



271 



.A 



f/hich indicates the inflation of the variance error due to tFTe non- 
inaependence of findings within studies. It is important to note that 
tne inflation factor does not depend on the number of findings, n_, within 
studies, but rather it depends on the number of studies, m. 
'* _ Another way to view the inflation of the variance error of the mean 
due to non-independence is to express Var(7.) as follows by dropping terms 
of order l/m: 



This formulation shows that the variance of the mean is increased 
by c^p/n due to the non-independence of findings ^ithin studies. 

The following table illustrates the inflation of Var*{y. ) over 
Var(y. } because o"f non- independence.-- It is based on the typical intra- ' 
study correlation of .50 from TabVe 10 and an assumption of n = 2 findings 
per study. 

No. of ^ 
• Studies Varfy. ) Var*(7. ) b/a 

5 (.18)a2 (.34)a2 3.4 

10 (■OS)a' *^ (.32)a2 6.4 

20 (.025)a2 (.31)a2 12.4 

■^50 (.ODa^ (.304)a2- 30 4 

100 (.005)a2 (.302)a2 50.4 

500 (.OODa^ (.3d04)a'^ 300.4 



/ 



J The calculations are remarkable. They show, foj example, that given 

an intra-stydy clustering of .6 for 50 studies with two findings each, the 
. vviance error of the 
mean of all 100 findings is thirty times larger than the variance error 

one would suppose to be true assuming independence. Jhus.j statistical 

intuitions developed from experience with independent data sets must be 

, held in check when de^ng v^th t^he kinds of non-independence data typical ' 

ERIC "6 



of meta-analyses. Furthermore, it is important that stati sti cal . techniques 
applied to meta-analysis take account of ^e non-'tndependent structure .of 

tne data, either by u/e of formulas for clustering such as illustrated here 

/ 

or by use of the jackknife technique. 

— -_ • 

Tukey's Jackknife 

An inferential technique wnich takes account of the interdependencies 
in a large set of findings in a meta-analysis is Tukey's jackknife method 
(Hosteller S Tukey. ig68). Space does not permit a basic exposition of the 
jackknife techmaue. One suggestion and an example must suffice. In calculatin 
tne ••pseuaovalues" in the jackknife method, some portion of the data set is 
discarded, and tne sample estimate of the parameter of interest is calculated. 
In a meta-analysis, tne portion of data eliminated should correspond to all 
those findings (e.g., effect sizes or correlation coefficients) arising from 
a particular study. Thus there will be as many pseudovalues as there are 
studies. The method will be illustrated on a small portion of the data frw a 
meta-analysis of psychotherapy outcone studies. * 

The data in Taole 6.12 represent 39 effect-size measures from'25 
experimental studies in which behavioral and nonbehavioral psychotherapies were 
compared for tneir effects on fear 'and anxiety. Jhe effect-size measure was , 

def d ' ~ ~ 
e ined as - = (X^^^^ - ^^^^^^^ . )/S^ . For example, study 1 produced "two measures 

of experimental effect, the first of which shows the nonbehavioral therapy as 

slightly superior to the behavioral therapy, a^^the second of which shows 

the behavioral -therapy nearly three-f ourths of. a standard deviation superior 

to the nonbehavioral tf^rapy.* The first step in establishing a jackknife 

confidence interval on the mean effect size is to average the 39 effect -size 

» 
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• lieasTires to obtain I. / Second, '^26 partial means;T_., a^^e caTculated by ' 

: eliminatingfeach study in turn; for example, the fiMkt B^tial mean i-s based^ • 

« oh'^the 37effect-size- measures remaining after the emrr*sT2erf rom study 1 

'■(flUO. -74) are r^smoved. Third, 25 pseudovalues are calculated as follows:^ 

. e-- » 2£X, pseudovalues' can safety be ri^arded as sample 

of observations of ndCTially distributed .ipd'ependent variables, wi"th expected 

value approximately equk> to the true mean effect size.and variance <t . 

Thus, the ;s%t.qf pseud 0 values'.e., can be.trgated as an ordinary sample of data 

to which t- distribution methods can be applied. The right-hand si.de'of 

Table 6,12 lists the calculations for the 95 percent confidence i.nlerval on' 

the tnje Effect yz-e; the interval does" not qufte span zero, indicating a 

statistically reliable superiority of t-he behavioral therapies.- By comparison. 

•a time^od 95 percent conf i-^ence interval on the population mean effect $ize 

calOjlated from- the 39 e/fect-^s^ize jneasures^ assun}ing independent observations,. 
. ■ . . \ . . , ■ 

e)(tends from -.;a'to+ .50." . ] •. ' - 

• Statfstrlcal inferential melfiods on the t^e of data illustrated here" 

cou,ld play -a role- in. directing future researcfi. 'from stahdard- ?rror& of 

averages and" confidence reg#i5 .abound regression planes, one can determine 

whpre parameters are" sharply estimated "by the current body of research studies 

arid- where ••emple estimates .regiain poor> The s'imf)le .crosj-tabluat'ion of the 

character! St ix's of studies completed is helpful for the same purpose. However, 

4 . ■ \ . . • ■ ■ 

Tt must be pointed out that the^mfaeii of studies needed to estimate accurately 

•an^aggregate.effect' jize is partly a function of t^e variance of effect sizes. ' 
For example, 5_sti?dies may determine accurately the ^ect of Amphetamines ^ 
on hyperactive .8-year-olds,, whereas 20 studies n^y be needed to achieve the 
same accuracy vs/Tth 12-yeir-olds if the effects are {fundamentally more variable 

^^n*- oTder children. '. ,t , t . 

ic , ^ . * ' ^'^ \ ' -^^ - 
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Table 6.12 



lltuatr^ton of AppllciUon 6f th« Jackkntfe Ttchnique of,lnteryat Esttmation of 
• Vita SizV 



Mean 





Efiect-Size 


Pseudo Vaiuos 


"5 


Sludy*No 


Measures 


« 26jr - 2Sjr, 


^ Caiculati6ns 


,4— 








1 .. 


- 10 
' 74 ' 


.366 


N^29 effaci-so^Jrr^ifsures 


2 • 


43 

45 


.528 


* 


3 !! 


65 


493 


JE^».186 

♦ • 


1 4 . . . 


52 


407 


5 


20 . 


* 197 


4j« 457 


6 


- •16 


-040 




7 


- 50 


. -.264 




8* 


3 35 










291 


95% lackknifo confidence 




18 


184 


mien/^i on ^. 


10 . ^ 


5^ 


278 




11.. 


- 39 


- 191 


•Tints' 2 06 


* 13 .. 


- 95 


-560 


- 


33 


^82 




14 


' 12 


144 




^5 


08 


!18 


186 1 (2.06)(*457)/v^6 » ( 002. J71) 


16 . 


1.90 


' 1 315 




17 , 


— 44 


-224 




18 


-1 00 


-.593 




19. 


06 








20 
10 
00 


0 

-097 




20 . 


64 


486 




^ < 










96 


9Sb 






05 
.20 


102 


- ^ 


23 


01 


072 




24 


12 
06 

14 








- 28 


' -368 




25 


- 22 


-079 




26 


1 28 
24 


1016 


% 








*- 
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.The 1J^]ustration above showed that a confidence interval based on^ack- 
knifmg on "study" as the unit of analysis was narrower thfn the confidence ' 



interval calculated by traditional methods with individual £.'s as the unita»*' 

of analysis. This was- unexpected arid contrary -^o the illustration to be 

* 

presented her?. It^probably ts due to the fact that the largest positive 
•and largest negative values af L arose from the same study. A recent 
application of the jackknife to meta-analysis by Haertel , Walberg and Kaertel 
(1979) gave results more in accord with expectations. When multiple linear 
regression weights were oackknifed using "study" as the unit, yie t— statistics 
for the significant of the differences of the beta-waights from zero were 
nearly always smaller f6r tne jackknife estimates 'tnan for the conventional 

estimates (Table 4 of Haertel, Wal berg and^Haertel , 1979). ^ 

An illustration -wril indicate the lines along which the jackknife 
approach to statistical inference in meta-analysis can be applied. The 
class-size arid achievement analysis above^ can serve„iL„tJrie iUustratioh'. 
A tota^ 108 comparisons of achii^ement in smaller and larger classes was. 
'available to fit the logarithmic curve. These 108 comparisons actually arose'^ 
•14 different studies. The multiplicity of data arose Soth from multiple • ; 
a^jarisons with a study (a study comparing four class sizes produced six 
and multiple achievement measures, ^^or individual comparisons. (The complete 
data set appears in GlaJs^nd Smith, 1978.) A traditional inferential analysis 
that takes no regard of the complex interdependencies of the data set (108 A 's 

corresponding to only 30 unique comparisons of cl^'ss-size arising from only 

. \ • ^ ' ' ■ 

14 studies)'would proceed a-3ong the following lines. ' 



V 

The least'-sq^iares regression of L^_^ onto logg(L/S) has the solution; 



6 



KloggL/S)^ 



For the 108 data points., 
^ ■' ' - ^ 

■ ■ - "108.780 , .p^n 

The estimate of residual variance equals:"* 



' ul = .1823 . 
e 

From traditional least-squares theory,- it can be shown that: 



cl ol [Zdog L/S)^l 



Thus, in the example, ^ \ . 

6- ^ 1823(385. 745)-^ =. .02174. 



Assumirvf normal distributions of estimates of 6, the 95% confidence interval 
on 6 ,is gi ven by: ■ * . ■ • 



. . 6 +1.98 OS ' .2820 0430 « (.2350, .32.50). 

The resuVts of the interval estimation prove to be qui te different 
^„ when the jackknife method -is used to take account of variation at the study 
leve^^ /Trie first step in calculating the jackknife interval on ^ involves 
' the calculation of all 14 pseudo-values, one for each study, by the • 

n - ' ' 
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formula: 



^-i * - 13s _^ , where 



§ ' is the estimate of 6 calculated by excluding all pairs of . and log L/S 
that arise from the nh study. ^ ^* % 

Using the earlier calculations on the entire data set, it can be 



computed that 



ni 



.108.780 -SAlog (L/S) 
= 3,946 - ^ / 



385.745 -2 (log L/S) 



wtiere the sumrr^ation is over all pairs of values of L and log L/S that 

e 

appear in the Uh sttidy. 

-The fourteen values of B_^. for the data appear^elow co<jed by the 
-study -number used in' Glass and Smith (197-8): 
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Study No. 

' 001 
003 
006 • 
008 
009 
015 

-0^9 \ 

■Q52 ' 

055 

058 

061 

073 

077 



6' 



-1 



.28611 

.215408 

.284079 

.285092 

.283260 

.282092 

.2^6599 

.28^715 

.281494 

.312188 

.277897 

.281980 

.2825^ 

.293232 



^222057 
1.134696 

.254973 

.241804 
, .265620 

.280810 

.222213 
* .285692 , 

.288578 " 
-.110444 

.335339 ^ 

.282^26 ^ 

.273095 

.135984 



e. « .293760, 
• s^ «' .265047 



The 952 confidence interval on' S is now calculated by the formula 



where n is the number of studies and df is n - 1 , in this case, but not . 
generally. • ' ' 

For the data of this iTlustration, the above formula takes t±A value': 



293760 + 2.14 (.265047)/ /W 



\ ■ 



' .293760 + .151590 
= (.1422, .4454). 



This jackknife interval on B is more' than 350«>^ide than the interval 
calculated earlier by conventional methods that treated each pair of values 
1 and log L/S as an independent data point. The jackknife methods appears 
to ^e appropriate and equal to the task of handling data sets interlaced 
with complicated dependencies. ^ 



Generalized Least^Souares 

The methods i,l lustra ted on^^ie class-si2§ data above are ordinary 
least-squares analysis (OLS) and Jackknife (JK) analysis. There exists a 
thffd means of analysis that is theoretically more rigorous and may prove 

superior to- the putatively inappropriate- OLS and the unknown JK analysis. . 

I* 

'The third method is the method of generalized least-squares analysis (GLS). 

OLS is the traditional method linear estimation based on a model of 
independently and normally distributed errors^ It is, in fact, ^ special 
case of the method of GLS, which permits 'the errors in the linear moael to 
be correlated.' Correlated errors prevail in the J:yp'e of data that are fitted, 
to the 'logarithmic model in meta-analyses. 
• O . ,* * 263 
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. Suppose, to begin with a simple example, that a' study of the relation- 
ship between class-size and achfevernent is performed where achievement is -s^ 
compared among class-sizes of n^p n^ and n^^ P^Pils (assume the n's increase 
in size f rom n^^ , to From the logarithmic model, 

z,^ » Slog n. ^e.. ' - . - 

« 

for the j_th puDil in tne j_th class. It Is assumed that e . . are 
independently and norriSliy distributed with variance c^. In oraer to remove 

J 

arbitrary scale factors and fit the model, the class means must be paired,- 
differencec and standardized to forrr^ delta measures; e.g., 

n ^ * 

Now, tae random variable L has a normal distirbution with 

mean '= Slcgln^/n^) , and 

variance « Var(e.. e.o) « — ^ . 

f 2' n, n^ 



Thfere are tnree^ju^s^sible pairs of trie class-sizes n^. , n^^ anc n^' ^^^^^ 
there are three possitle A's: However, the deltas are cohstrained by the 
■ restrictfon that , ' ' 

'-n^.n3^ n^-n^ n^-n^ 

ThtfT, 'one of the three adds no information to the remaining two; only two 

deltas need be considered. (In the more general case of J_ glass-sizes , -there 

are J(J_-l)/2 possible deltas, but only J_-l of. these are free to vary.) Thus, \ 

the available information is completely contained Vr any nonredundant subset 

Of J-1 deltas. It wil'l be convenient to work with only those*deltas that are 
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FoPfrted t)y. comparing each class-size in turn wi'tfi the smallest class-size, e.g.. 



1% 



In the three class-size comparison, thje deltas will be 

n 3nd L„ . ' 

W» have already seen that L^^^_^^ has error variance eaual to c'O/r., * l/r^]. 
Likewise. \^.n3-«^2 error variance equal to a'(l/n^ + l/r,). it remains -'to 
determine the covariano^of tnese tio deltasl^ 

Covas^ [l ' , t ' } ■ ' 

Covar f?.^ .^-2. e.^ - e.,)'- 

f^ovar (I. ^e.^ ) -■q • 0 + 0 » - - ' 
Var(F.^) = aVn^. ' _ ^ ^ ^ 

• * • • 

. It should be clear that in a se t. c^- J-1 deltas for^pc by comparin_5 

^ch n. in turn with, n^ . that each delta has variance given by , ^ 



n, -n ' " ' 



. "r"j ^"1 



and each pair of- deltas has covariance given-by 
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Hence, the se\ of two deltas in our example has' the f 
• covariance matrix of errors^: 



0 1 1 owl ng van ance- 



r2 



1 .-L ± 

"l "2 ■ "l 



(18) 



A genera' linear model could n^w 'i^e stated 'for tne two aeltas 

^^^^^^^^ 

A = Slogln,/n J * e: 



wnere tne vector of t's are cistributec nonr.aliy with zero mean vector and 
variance-covariance matrix in formula 1^^ above. 

Denoting '^ne variance-covariance matrix of errors' by : . tnen 
gonnston (1S72; s.^ows that the generalized least-squares ^solution for 3 is 
contained in the fol lowing' quanti ti esj 



7-1 -1 T-Tl 



(19) 



wnere l ;s tne vector o'' deltas (two, in the examole;, and X is tne matrix 
of independent variable valuer, (in the example, »>a 2 x 1 vector with entries 
logln^/n^; and logln^/n^)), ■ ^ > 



4^ 



Var(3) - a^(X^Z^''x)-y 



(20) 



ana an un'bi-ased estimate d"f b^' Is given by 



T, 



♦ • * {L- llYl (A - XS)/('N - k). 



(21) 



wnere N is the number of deltas an"^ is tne number of parameters estimated 
(one, in the example] . 

In a typical me'ta-ana lysis, deltas will arise from more than one 
study. Thus, tnere may be two deltas from Study #1 (J«3) and three deltas 
' from Study n (J«4)^This affangemeift' of data does not substantially compli- 
caU the-GLS analysis outlined above. The vector of deltas is now of order 
5 X 1 and the variance-cpvari^nce matrix of errors, e, is a block-diagonal 
matri x 'of oraer 5x5: 



n, n- 



0 



1^ 



'^l 



1 



n. 



1 

i 

n7 



,1 • 



0 ,-1-^- 1~ - 



Where tne ri, in Study « 1 mjy be different 'rom the n. in Study^#2 (and 
; ikewise for ru, , . . . ) . 

" The block diagonal. matrix in formulas (19). j2C) and (21) yields the 
.^prooer estimate of 3.. its standard error, and an estimate of error variance. 
The distribution of S divided &y its estimated standard error is known to 

A 

be Student's t-distri bullion with -degrees of freedom equal to N-1 , where N 
i's the number 0' deltas (there being J-l deltas *or feach study) (Johnston, 
1972, -p. 21C). ■' • . ^ 



267 



The above argument appears to be^ mathematically complete and appro- 
priate to the inferential problems of fitting and testing the logarithmic 
model. A Monte Carlo study is not strictly requi red— fai 1 ing the discovery 
of some flaws in the mathematics*— but it will be useful to check the validity 
of the GLS procedure whi-le carrying out a Monte Carlo study to check the use- 
fulness of the OLS and,JK solutions. One knows a priori that the QLS and 
uK confidence intervals oo not have complete mathematical justVications; 
tne OLS intervals are likely to be uselessly inexact and, as always, the 
accuracy of tne ^Dproximation upon, which the JK intervals are based must be 
cnecKec. 

In the following section, the results of a Monte Carlo simulation are 

presented in which the accuracy of confidence intervals constructed by the 

ft •' 

GLS, OLS and JK methods \i compared. 
Monte Carlo Study " 

^ « % 

A Monte Carlo study was conducted to check the validity of OLS, GL€^ and 

JK confidence intervals. The structure of the simulation (i.e., number of 
studies, number of Vl ass-sizes compared, and**the sizes of classes compared) 
was cnosen to duplicate exactly the data set in the meta-analysis ^of class- 
size arrd achievement CClass and Smith, 1979). The data set stri/cture is as { 
follows: 

0 



% 
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Table 6.13 

Structure- of the Data Set Used in the Monte Carlo Study 




example, in Study #1, three class-sizes were compared: 1, 3 

and 25. ihis study gives rise to two values aelta: i^_^and ^^ ^s- 

Study #4, two class-siies are compared yielding a single delta: Aon n,. 

28- 1 1 Z 

Given tne above aata structure, tKere a^ only two parameters of the 
logarithmic model that need to be specified: the value of 6 and the error- 
variance (N.B.: this error variance descri bes, error in o^ervations of 
indivi^uaU; it is not the same as the error c). The value of 6 can be 
specified-without restriction; in the simulations, values of .25, .50 and 1.00 
were used. The error variance, a^,- is speci fied in a round-about way by first 
specifying a value for the linear correlation between z_ and log(n.j/n2) in 
the m^el ' ^ 

2 = Blog(nT/n2) +.e, and 

tnen solvin-g for assuming- that 2_ has unit variance. In the simulations 

reported here, the linear correlation, p, between z and login, /n^) was 
^ . -^12' 

tafeen to be either .65 or .85. Hence, the corresponding error variances 
equal 

= /I - .55^ ' ' 0.75; 

c| « /I - .85^ ' X 0.53. 

The steps 1'n the simulation proceeded as follows: 
Step 1. Having specified values of n^, n^* 6 and o (say, 6 = .5, 
0 » .8b), scores are generated according to the model 

z « 6log(n^/n2) + e. 
Step 2. Deltas are calculated via \ 

"r"2 ""z 
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step 3. • In this way, all the deltas the K-study^data se^specified 
above are calculated. 

Step 4. The ordinary least-squares (OLSj estimate of S is calculated 

in the usual fashion from the 30 deltas that arise from Table 6.13,. 
The 1 a confidence interval on 6 is calculated from 



^ • l-a/2^29 * 'V 

Step 5. The jackkmfe (JK) confidence interval on S is calculated by 
means of the K pseudo-va lugs arising from the data strjcture 
in Table 6.13 and then by means of the formula 



Step 6. The general ized • 1 eas t-squares (GLS) confidence interval on 
6 is calculated via 

J 

the estimates are given in fonnulas (19), (20) and (21) above, 



/ 



Step 7. Each of the three .intervals was calculated for each single 
simulation and it was recorded whether the 90 perq^ent; 95 
percent and 99 percent confidence intervals captured the 
true value of B. The simulation was repeated 1,000 tim6s and 
the proportions of intervals capturing the parameter ^we re 
counted. 

The results appear in the following table^for the 90 percent confi- 
dence coefficient. The results^for the 95 '(Percent and 99 percent confidence 
intervals appear in Tables 6, IS and 6.15 
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•'Tatle 5.14 

bmplrical Confidence Coefficients for True a = . 10 ^or' 
Ordinary Least-squares (flL), Jackknife (JK) and Generalized 
Least-squares (GLSj Confidence Intervals 



f 



^^^od 5f ' Empirical Confidence 

Estimation Coefficient 



) OLS . .678 

■25 JK ■ .857 

GLS ■ " ' .900 



OLS * .546 
'50 ^ JK .845 

GLS .909 



OLS ^ .641- 
1-00 JK ■ . .857 

GJ.S .910 



OLS .553 

\ 

2S ■ > .866 



* 



GLS ,. .894 
" 

DLS ^ .547 

f 

,50 JK ' .852 

GLS ^ .906 



. OLS 
1.00 JK 

GLS 



.642 
' .654, 
.897 
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s Table, 6.15 

tmpirical Confidence Coefficients for True- a =.05 ^t>r 





Ordinary Least-squares (OLS), Jack^cnife 


(JK) and Generalized 


/ 


Least-squares (GLSj Confidence Intervals 




0 


Method of 
Estimati on 


Empirical Confidence 
- Coefficient 




OLS- ' ^ 


.785 




.25 • ' JK 


.917 




GLS 


.955 - . ' 


.55 


< ^ , 

CIS 


.740 




.50 ^ JK 

t 


.905 




uLS 


.949 




\ 

OLS ^ 


.744 , 




1.00 » JK 


.912 




GLS • 




• 


OLS 

.25 ^ ,1K 


.742 

/ 

.914 


-J 


GLS 

• 


.947 


V 

;85 


1 


.771 




.50 JK 


.929 




GLS 


.947 




^ > OLS 


.734 



1:00 JK. ' ' .914/ 

GLS .943 
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Table &.16 ' 

tmpirical Confidence Coefficients for True- q =.01 for 
Ordinary Least-squares (CLE), Jackknife t«JK) and Generalized 



7 



1^ 


Least-squares 


(GLSj Confidence 


Intervals 




p 


• 

6 


Method of 
Estimation 


Empirical Confidence 
Coefficient ' 




1 




OLS 


.856 






. .25' 


JK 


.955 - 








■ GLS 

f 


.993 




.55 




OLS ' 


• i .871 






.50 


JK 


.973 








GLS 


^ ■ .987 








OLS 


.876 






1.00 


- JK 


.959 








GLS 


.988 








OLS • 


.879 


• 




.25 


JK 
GLS 


.973 

'.994 




85 


• 


OLS 


'..875 \ 






.50 


JK 


• .9^1 






• 


GLS 


.991 








OLS 


< 

.859 • 






1.00 


JK 


.958 




> 


<• 


GLS • 


.983 
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. . The resul-ts in Tables 6.14 and 6.16 are remarkably similar and the 
*ihdings are clear. The GLS methoa is .accurate;- it yieljjs the confidence 
coefficient that one expects to have whi=n referencing the 1 rj^ll percentile's 
of tne proper t-distnbution. The empirical and theoretical confidence coef- 
•ficients .were never more than .01 units discrepant— a discrepancy well with'in 
the bourrds of sampling error for 1,000 cases, as it must be since the GLS ' 
solution is mathematically correct. By comparison, the GLS confidence intervals 
were grossly in error.'' For example with 6 = 1.0 and p = .B5, the nominal 
90 percent GLS confidence inte'-val around ,5 has on]y .642 probability of ' 
captunng the parameter value of 1.0— an error in the expected confidence 
coefficient of roughly one-third. 

The JK confidence coefficients are rrore accurate than the' OLS coef- 
ficients but they are proDably >«iaccepta!jly discrepant from theoretical 
values,. in absolute terms, and they^are clearly less aca^ate than the GLS 
confidence intervals.^ For examplev for B =^fe^and.p = .55, the nominal . 
90 percent JK interval has actual Confidence coefficient "Bf 54. 5S, an error 
of over 5 percentage points, wh^rea^ the GLS interval, as expected, sbows 
an empirical confidence .coefficient equal (withjn sampl i c^^rror) to ^he 
theoretical value. ' . , - 



^ • /' • ■ , 

A MOnte Carlo simulation showed the generalized least-squares confi- 
dence intervals on 6. of the logarithnjic'm^l to t)e accurate^ ffccordi/ig to 
theory. The ordinary least-squares confidence intervaji proved to be^ 
grossly inaccurate and. unacceptable--victims of the non-independence of 
thai-'s from which the logarithmic model is fitted. The jackknife confidence 
intervals (although not as inaccurate as the OLS intervals and although 
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possibly capable of being improved by proper normalizing transformations yet 
to be discoyeredj were less accurate than the GLS intervals. 

The method of generalized least-squares iS(an accurate method of*, 
int^al es_timation of 3 in the logarithmic model which finds frequent appli 
cation in problems of .meta-analysis . 



r 
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CHAPTER SEVEN 
AN EVALUATION CF META-ANALYSIS 



The aporoach, to research integration referred to as "meta-analysis" 
- ' 

is-nbthing more than the attitude of data .analysis applied to quantitative 
summaries^ Of individual experimeRts . By recording the propert'ies of studies 
and their findings in quantitative terms, the meta-analysis of research 
invites one who would integrate numerous and diverse findings to apply the 
^.full power of statistical meth'ods tq the !ask. Thus it is next a tecnnique; 
rather it is a perspective that usH tfiany techniques of measurement and \ 
statistical analysis. ' * ' 

A tenet of evaluation theory is that sel f -assessmer]^ is alwyas jnore 
suspect than assessment by a neutral party. , There is a tone of false 
pranise in professing to ofiticize an endeavcr in which one has invested 
himself heavily. Al tn^gh we cannot pranise t"o deal with the strengths 

, and weaknesses of tb/^meta-analysis approach with an even hand, we can 
assure the reader that most. of the objections raised against the procedure 
by critics of e^rl ier applicafi ens are recorded and discussed below. ' 
Applications of meta-analysis to research in psyotiotherapy , school class- ; 
size, special education and other problems have produced many technical • 
^ criticisms. Among the "persons commenting on meta-analysis are the 

. follcwing:, Mansfieki & Busse (1977), Bandura (1987), Eysenck (1978a)', 
Gallo (1978), Jackson (1978), Paul (1978), Presby (1978), Walberg (1978), 
-••-Anonymous (1979), G41 Van ^1979) , Rimland (1979), Simpson (1980), Eysenck (1978b). 
Shapiro (1S77), Cook and Leviton (1980), Hunter '(1979) , RoidT~Brodsky and 

■ Bigelow (1979)' . . 

■ ' ■ ■ . • ( 
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"I • The Apples and Oranoes Problem / 

It is illogical to compare "aifferent" studies, i.e., s.tudies 
done with different measuring techniques, different types of 
persons, and the like. 

2. Use cf Data From "Poor" Studies 

Meta-analysis advocates low standards of quality for .research. 
, It accepts, un'critically the findings frory, studies tha^are 
poorly designed or ai-e otherwise of low quality. Aggregated 
conclusions should only be based on the findings of "good" 
studies. ' ' 

3- Se-lecrion Bias in Reported Research 

Meta-analysis is dependent on the findings that researchers 
report. Its findings will be biased if, as is surely true, 
there are systematic differences '^ong the results of research 
that appear in journals vs. tooks vs. theses vs. unpublished, 
papers. 

Lumpy (Non-independent) Data 

Meta-analyses are '^conducte.d op large data sets in which multiple 
■^results are deri.ved from the same study; this renders the data ' 
non-.independent and gives one a mistaken impression of t^e - 
reliability of the results. 



r 



. In the ren)j1nder of this- section, these criticisms will be addressed 
with counterarguments and data accumulated from several extant meta-analyses:" 
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Criticism #1 - The Apples and Oranges Problem . The metd-analysis 
'approach to research integration mixes a'pples and oranges , 'it makes nn . 
sense to integrate the finoinqs of qiff'erent studies . . 

The worry is often encountered that in combining or integrating 
studies, one is forcing incommensurable'studies together, or tryincj, to 
make different studies answer the same question, t)r "mixing apples \nd 
oranges." ' Implicit ip this concern is the belief that only studies that 
»are tne same in certain respects can be aggregated. "A study'-s depen-. 
dent variables and those independent variables which are measured must 
be measured in the same way as,, or in a way subject to a conversion into, 
tnose employed in the rest of the j^tudies" (Light and Smith, 1971, p. 449) 
This thesis should be clarified in at least two ways: "Same" is not -defined^ 
a.nd the fespects in which comparable studies must be 'the same are unspeci- 
fied. The claim that only studies^ whi ch are the same in all respects -can * 
be compared is self-contradictory; there is no need to compare thern since 
they would obviously have the same findings within statistical error. The 
only studies which need to be compared or integrated are different studies. 
Yet it is intuitively clear some di fferences "among studies are so large 
or critical that no one is interested tn their integration. What, for 
example, is to be made of study #1 which demonstrates the effectiveness 
of disulfiram in the treatment of alcoholism and study #2 which demonstrates 
the benefits of motorcycle helmet laws? Not much, I suppose. But it 
hardly follows that the integration of study #1 on lysergide treatment of 
. alcoholism and study #2 on "controlled drinking" is meaningless; one 'is 
understandably concerned with which treatment has a greater cure rate. 
Is the essential difference between. the two examples that in the former • 
case the -problems addressed by the studies are different but t'he p roblem 
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is the same in the latter example? "Prob.lem" is no 'better defined than ^ 
"study" or "findings," and invoking tne word clarifies little. It is' 
easy to imagine the Secretary for-Health comparing fifty studies on 
alcoholism treatment with fifty studies on drug addiction' treatment or a 
Hundred studies on the treatment^of obesity. If the two fonnef groups' 
■ of studies are negative and^the latter is positive, the Secretary may 
decide to fund only obesity treatment cen^ters. From the Secretary's 
point of view, the problem is publ i c heal th , not simply alcuholism or 
drug addiction treatment. 

_ Suppose thct B researcher wished to integrate existing studies on 
computej--assisted instruction (CAI) and cross-age tutoring (CAT) to ^ 
obtain some notipn of thei r 'relative effectiveness. That studies #1 and • 
#2 on CAI used c^ifferent standardized achtevement tests to measure progress 
In mathematics is a difference that should cause little fconcern. considering 
the basic -similarity of mosf standardized achievement ^ts. He who . ♦ 
wMuld object to integrating the findings from these two studies must face ' 
a succession of difficult questions which begin with whether he will 
accept as comparable two studies using different forms of the same test or 
whether he will accept as equal two average scores which were achieved by 
^^^^^•^^"^ patterns of item responses tc; the same fom of the same test. 

Imagine further that of JOO CAI studies. 75 were in n^th and 25 In 
science, whereas of the 100 CAT studies, 25 were in math and 75 were in 
science. Are the aggregated data on effectiveness from 100 studies each 
^ of CAI and CAT meaningfully comparable? It depends entirely on the exact 
form of the question bei ng Jddressed. If CAI is naturally much more ^ 
frequently aoplied to math instruction than to science (and vice versa 
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for CAT), then the simple aggregation of effectiveness measures may fnost 

meaningfully answer the question of what benefits could be expected by a 

typical schoolfrom installing CAI (and using it in the natural manner, 

which means three times more extensively in math than in scienai^instead 

of instigating CAT, If, however, one were more interested in the question 

of whether CAI was a- more effective medium ' than CAT, then, such a comparison 

'ought not to be confounded with problems of the difficulties of learning 

'math versus science. In these circumstances, a straightforward aggregation 

of the findings in ecch set of 100 studies would not bt'most meaningful. 

To compare the media independently) of subject taught, one could calculate 

effectiveness measures separately for math and science within either CAI 

or CAT, . Then total .effecti veness measures for CAI and CAT wou\d be 

constructed by some appropriate method of proportional weighting. 

There exists another respect in which critics are inconsistent who 

criticize meta-analysis as meaningless because it mixes apples and oranges. 

These same critics, researchers themselves, habitually perform data 

f 

analyses in their own research in which they lump together (average or 
otherwise aggregate in analyses of variance, ^-tests ifid whatever) data 
from different persons . These persons are as different and as much like 
apples and oranges in their way as studies are different from each other, 
yet the same critics who object to pooling the findings of studies 1, 2, 

10 see -nothing at all objectionable in pooling the results from 
persons 1, 2, .... 100 in their own research. An i nconsistancy of no 
small order must be acknowledged at thi| paint, or else the critic of 
meta-analysis must argue convincingly that the two kinds of aggregating 
identified are qualitatively different; and he should specf*^ .how they 



are different and nhy it matters, which wi 1 Tnecessari ly entail presenting 
empirical evidence to demonstrate that studies using different populations, 
measuring instruments. Cata^analyses . etc. are fundamentally inconmensurable. 
(The ironic dilemma posed here is that such an empirical demonstration 
would be of itselt an/analysis of exactly the type which we have referred 
to as a "meta-analysis". ) 

,4 s„Hn^'*^"J?.^^f-.!!^* meta-analysis approach "advocates low standards 
of judgment of the Quality of studies . ' ' '■ 

Although Eysenck (1978) saw us as "advocating"" low standards of- 
^resea-rch quality, other critics have viewid^us merely as being incapable 
of. tellYng the d-ifference between "good" and "bad" stuaies. We have been 
accused of relying on undiscrimi nati ng volume of data rather than on • 
quality of desi'gn and evidenci. In the academic wars waged over the- 
questions of the benefits of psychotherapy, the judgment of "quality, of ' 
design and evidence" -has usually been the ad hoc impeachi nq on methodological ' 
of the studies of one's enemies. 

Somewhere in the history of the social" sciences . research criticism 
took an unhealthy turji. 'It became confused wi th research desi gn. The 
critid often reads a published study and second guesses the aspects of 
measurementand analysis that should "have been anticipated by the researcher. 
If a study "fails" on a sufficient number.,of these criteria--or if it 
fails to meet conditions of which the critic is particularly fond-the 
study is discounted or eliminated completely from consideration. Rese'arch ' 
design has a logic of its owa. but it is not a logic appropriate to 
research integration. The researcher does not want to perform a study 
_deficient in some aspect of measurement or analysis, but it hardly follows 
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IS the same' in the latter example? "Problem" is m better defined than - 

"sfUdy" or "findings," and invoking the word clarifies . 1 it-^le.. It is • 

easy to imagine the Secretary for Health compaj-ing fifty studies on 

alcoholism treatment with fifty studies- on drug ^diction treatment 'or a 

hundred studies on the treatment of'obes'ity. If the two former' groups 

of studies^are- negative and the latter is positive,- the Secretary m^y 

decide to fund only obesity treatment centers. From the-Secretary ' s 

■ . *■ 
point. of view, the problem is public health, not simply a-lcuhol ism ffr 

drug addiction treatment. 

^' Suppose that a researcher wished to integrate existing studies on 

computer-assisted instruction. (CAI ) aqd cross-age tutoring '(CAT) to 
obtain some notion of their relative effectiveness. That studies #1 and ' 
#2 on CAI used different standardized achievement tests to measure progress 
in mathematics is a difference that should cause little concern, "considering 
the basic similarity of most standardized ^ctNevement tests. He who * 
^ would object to Integrating the findings from these two studies mus't face 
a succession of difficult questions which begin with whether he will ' 
accept as comparable two studies using di ffe rent -forms of • the same test .or 
whether he will accept as equal two average scores whi4± wer^e achieved by 
'^^^^^'"6"^ patterns of item responses to the same form of^ the same test. 

Imagine further that of 100 CAI studies, 75 were in math ^nd 25- in 
science, whereas of the. 100 CAT studies, 25 were^in math an.d ^5 were in 
A.ence. ^e the aggregated data on effectiveness from 100 studies each 
.of CAI and CAT meaningfully comparable? It depends entirely on the exact 
form of the question being addressed. If CAI is naturally much more 
frequently applied to math instruction than-'to science (and vice' v.ersa 
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that after a less-than-perfect study has been-donev Vts findings should/ 
not be considered. A logit'of. research rinte^ration* could lead t^ a 
description, of ^design and analysis features -and study of their covariance 
•. with research findings. If, for example, the covariance is quite SfnaVl 
. between th^'e size of an experimental effect and whetbe)- or not subjects 
. ..were vo^inteers, then the; force of the i^ticism that some experiments 
used volunteers )i cleaf^ly diminished-.'"' 

Our early work on the effects of psychotherapy (Smith and Gl-ass, 
Vm) never strayed far from a sensitivity t^ design and methods in the 
studies integrated. However, across the field of ps^tchotherapy outcome 
^ evaluation, there was basicall-y.no correlation between th'e "quality" (in 

the sense of Campbell and Stanley, 1955, 'and others) bf\he design and the 
^size of psychot^ierapy effect (Smith ai^ Glass,/l977, p. 758, Table 4]? Thus 
any distinctions b.itweefr "good" and "b/d" studies'would- leave ttie overall 
Picture unchanged— a fact that should/be' cl«fi.r to anyone who vinderstands" 
what the absence of Correlation implies. Ho purpose- would hjve been 
served by reporting resets separately for ""good" and "bad" studies since 
they-would^have been essentially the same. In a meta-analysis of ^^itational 
research on the effect of class-size on achievement, Glass*and Smtth (J9790 
found that ftality of research design (essentially th^ degree^f control 
exercised over the assignment of pupils to etasses)'was the highest cor- 
relate of effects. The sensible course was elected, and results were 
jjrejented only for the studies in which careful experimental control was 
exercised. 

. An early attempt at 'meta-analysis was cha^racterized somewhat cynically 

by 4 critic as follows: "Although no single study was- well enough done to 
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^prove that psychotherapy is effective, when you put all these bad studies . 
together, they show beyond doubt that therapy works." This skeptical charac- 

^ tenzation with its paradoxical ring is a central thesis of research integration^ 
In fact, many weak studies can adjd up to a strong conclusion.- ' Suppose that, 
in a ^roup of 100 studies, studi^es 1-10 are weak in representative sampling 
but .strong in other respects; studies 11-20 are weak in measurement but 

'Otherwise strong; studies 21-30 are W9ak in internal val i di ty only ; studies 
31-40 are vs/eak only -in^ data an^sis; and so on. But imagine also that all 
100 studies arre somewhat similar in that they show .a superiority of the 
exper^imental over the control group. The cri ti c -who' mai ntains that the . 
total collection of studies does not support strongly the conclusion -of 
tjktm4nt efficacy is forced to invoke an explanation of multiple causality 
(i.e., the observed^ difference can be caused either by this particular 
measuremejit flaworthis particular design flaw, or ttiis particular analysis 
flaw, or. ..)^ The number o-f multiple causes which must be invoked to ' 
counter the explanation of treatment efficacy can be embarrassingly large for 

_even a few dozen studies. Indeed, the multiple-defects e;tplanation wilMoon 

grow into a conspiracy, theory or else collapse under its own weight. >' 

Respect for P3''sij|diy|id good sense demands an acceptance of the notion that 

'imperfect s.tud,ieS''^^^on verge on a true conclusion. 
' ft • *" 

An Important part -of every meta-analysis with which we have been 
associated has been the recording, of methodological weaknesses in the 
-original studies and the examination of their covariance of study findings. 
Thus, the influence of "study quality" on findings has been regarded 
consistently as an empirical a posteriori question, not an a priori 
matter of opinion or judgment used in. excluding large numbers of studies 
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, from cpnsi deration. But, avcritic once asked us, "Why do you study the 
difference in the findvtTgs of 'good' vs. 'bad' studies? If you found a 
difference, wouHdn't you reject, the 'bad' studies? And if you found no dif- 
fer^nce, wouldn't* the findings of the.; good' studies be the same as those for 
all studies regardless of quality?" The dilerrana was neatly posed, and we 
. "hope the answer is comprehensible. Surefy, the "good" studies (i.e., those 
with excellent controls and sophisticated technology) are to b.e believed -n 
a conflict is observed betweien findings of ^od and poor studies (cf. Glass 
and Smith, 1979). However, if "good"^4nd "poor" studies do not differ ^ ' 
greatly* in their fi^ndings, a large data bas<^(dll studies regardless of 
quality) is much to be preferred over a small data base (only the "good" 
studies). The larger data base can be more readily subdivided to answer 

« 

specific sub-questions that are inevitably prbvoked by the answers to the 
general questions (e.g., "But are behavioral therapies superior to cognitive 
therapies fo« chi Idren „with low I.Q.?"). The smaller data base of ''qood" 
Studies only is likely to have too few instances to address many sub-questions 
Moreover, even when the results of "good" and "bad" studies differ, even 
thy bad or not-so-bad studies/can be informative; for suppose that six ' 
studies of quality "10" on a ten-point scale show a correlation of X and 
Y of .70 on the average, and that twelve studies of quality "9" show an r 
of .65, studies of quality "8" an r of .60, and so on down to quality "1" 
an £ of .10, say. This pattern is far more informative and lends greater 
credence -to a r of .70 for six studies of t?^ qaulity than would the 
results of the six studies in isolation from all others. 

The covariation of research 'quality with results is, then, an 
empirical matter of central concern 1n meta-analysis, as well as being of 
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interest to research methodol ogists ^who finsi meta-analysii too much to 
swallow. Fortunately, we have several thousand data'^that can inform us on 
th£,giheral question. 

In Table 7^1 appears a summary of the differences in results among 
studies of varying ^research quality for twelve different meta-analys-es. * 
Each meta-analysis was performed on a literature of comparative experimental 
findings. The ba^^ic unit of measurement for the meta-analysis was the 
effect si^e, tS, a^in each instance it was define^ so that positive values 
indicated findings in accord with the favored hyoothesis of the field in 
question (e.g., a positive ES in Hartley's meta-analysis of computer assisted 
math' instruction indicated' a" superiority^ of CAI over traditional teaching). 
In each meta-analys^', the rating of High, Medium, or Low research quality 
was primarily an £^?sessment of internal validity of the experiment (Campbell 
4n(j Stanley, 1966). \j 

If Table 7.1 achieves nothing else, it ought to be, at the very least, 
an effective antidote to rampant a prior\sm on the matter of which studies 
should be admitted. Ss evidence in deciding research questions. Some of the 
meta-analyses in Ta^ble 7.1 show a relationship between design quality and 
findings and others So not. But in those analyses with substantial numbers of 
cases, the differences in size of average experimental effects between High 
validity and Low validity experiments are suj^risingly small. The on.ly notable 
exceptions to this trend in the entire table are Hartley's ('78) tutoring 
analysis, Smith ('80) and Carlberg's (j^) resource room analysis; but in 
each of these instances as just suggested, the large deviations are probably 
merely the consequence of small n_'s in particular categories. As a general 
rule, tl^ere is seldom much more than one-tenth standard deviation difference 
between averaga effects for High validity and Low validity experiments. 
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- Table 7.1 

Relationship BetweenCj^eseapch Quality (Internal Validity) and 



. 'Fi ndings 


in l2 f-teta-Analyses 


of experimental 


Literatures 






> 


Relationship Between Intgrna/ Validity 
and Average Experimental Effect Size 


Investigator(s ) 


Topic 


Hi qh 


Medium • 


^ 1 nw ' 


hartipy (77; ^ ^ 
* 


ConiDuter-based 
Instruction 


n: n 

• 0 

r .311 


55 
.389 


- • 23 

503 




Tut^ori ng 


n: 52 


12' 


'9 






I.: .-584 


.305 


1 .066 


Ku 1 i k , Kul i k & 
Cohen (79) 


Individual ^ 
Instruction 


n! 22 
.409 




22 
^^04 


Smith' ( '80a) 


Sex bias i n 
psychotherapy 


n: 30 
^ .: -.18 


. 26 
-.01 


4 

.77 

t 


Smith ('80b) 


Effects of aesthetic n: 84 
eauc. Qn barsic 
ski lis ^ . : .53 


48 . 

. -52 


.69 


Carlberg ('79) ^ 


Spec. ed. room 
placement vs. f 
reg. room 
placement 


n: 83 
r.: -.19 


■ 187 
-.11 


■ 52 
.02. 


, ■ - 


Resource room 
placement 
vs. reg. room 
placementj 


n: 3 
: 1 . 1 3 


31 

/. .12 


5 

■.56 




Spec. educ. inter- 
vention vs. 
^ classroom 
" treatment 


n: 40 
^. : .19 


.27 


35 

.53 . 
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Table 7 . 1 (conMnue^d) 



Ml 1 1 er ( 78) 


Drug jtherapy for 


n: 


297 


16 v\ 


'• 37 




• psych, disorders 












7. : 


-.48 


- . .54 


.64 


HearoliJ ( 79) 


Effects of TV on 


n: 


176 


175 . 


176 




anti-social behav.. 














. JJ • 


. 3Q 


.27 ' 




E-^fects of TV on 


n : 


35 


.35 


35 




"'pro-social** 












benav. 




.59 


.53 


.67 



SUBTOTALS 



Smith, Glass & 
Miller ('80) 



Psychotherapy 



High 
n: 833 
^..•'.36 

n: 1157 
i : .82 



Medi urr 
667 
.21 

378 
.75 



Low 
515 
.43 

-224 
.68 



TOTALS 



n: 1990 
I: .63 



1045 
.40 



739 
.51 
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Our experience with meta-analyses df experiments wa5 matched by Yin, 
Bingn^m and Heald U976) in their study of the relationship between ca'se study 
quality and findings. Yin and his colleagues collelfed 140 case studies on 
. ^t^Cl^ologiCal innovations, every study they could find that appeared after 
1965. They devised four criteria for judging tne quality of the stu(Jies: 

1) presence ^f operational measures of innovative device and outccxnes, ■ 

2) presence of some relevant r&searcn design, 3,) overall adequacy of ' evidence, 
in relation-to conclusions, and 4) adequacy of evidence in relation to each 
stated outcome. Tney correlated research qua! ity, .so defined, with stufly 
outcomes and concluded: / 

• "'o extent that one. objective of our investigation was to 
examine the widest possible range of reported innovative experiences, there 
.was thys strong reason not to discard the lower'quality 'studies. At the 
same time, tne general lack of relationship between quality and the oi^tcomes 
of the innovative exDerifnce suggested that the inclusion of lower quality 
studies -would not affect the overall conclusions to be drawn fr^ the ' ■ 
review,," (Yin, Bingham and Heald, 1976, pp. 153-4) 

. In an earlier study (Yin and Yates, 1975), -'the investige^tors did observe 
an associati^on between research quality snd findings, just as we see a relation- 
ship in som^1iteratui-es and not in others. WithoiJt thinking about the matter 
' further, one -i s • tempted to ask why "poor quality" studies are included in the 
first ;>lace if they'll only be retained provided they agree in their findings 
with the high -quality studies'. If^there were virtually huge numbers of both' 
■ well-done and poorly-done studies on' a question, the answer vould be clear: 
th/ow away the poorly-done studies and^eed ttje message of the- high quality 
' research. But the usual situa>tion is.^that there exist several studies, some 
which are High quality, seme average and some poor. 

ERIC- , ^ ao„. ; . , 



Suppose thafflf fifty experiments on the effects of jogging on life 
expectancy, 25 ar$ judged to be of poor design and execution, 15 ar^ regarded 
as modefafely well done and 10 ar^ well-donjg. Suppose further that the average 
effect .(experimental vs. control group difference) is 2.86 years li'fe' expectancy 
favoring, the 'e'xperiraental group *in the IQ best designed studies . Should one 
base his opinion on the results of these 10 studies and ignore the findings 
of the other forty? Let's press on and see. Suppose that the effects "shown 
by fne 15 moderately well' don^ and 25 poorly done experiments were 2.74 years 

'and 2.60 years^ respectively. These findings do, in fact, support the finding 
of^ the less numerous well done studies and make it more credible. Imagine 
contrariwise that the average effects for the moderately w^ll done and poorly . 
done experiments were -0.^7 years, and 8.65 years, 'respectively . Now the finding 

^of tne ten-well done experiments is placed in a context of chaotic error and 
variabil ity 'and it is more suspect. People reason and judge with the help of 
complex pattern^ and contexts; scholar^ who are doctrinaire about research 
quality when they integrate research studies ignore this fact. It is precisely 
this fact that war§ ignored in a widely publicized cr^jtique of our metaranalysis 

^of the. school class-size and achievement relationship (Educational Research 
Service, 1580). , . 
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' ' Criticism #3. Selection Bias In Reported .Research . 

M£ta-ana1vsis is depe ndent on the findings-" that researchers r eport 
Its .Tndinqs will be biased if. as is surely true, thire are sv stematir 
differences amano the results of research tr>at- appear in lournals vs. hnQk«; 
vs. theses vs. unpublished papers . \ < y ~ 

The- findings of a dozen meta-analyses c^n be "Used to inform us on the 
severity pf one aspect of this criticism. Several investigators working on 
the integration of experimental literatures compared the effects revealed 
by experiments depending of whether , they v^ere published in journals, books, 
doctoral or master's theses, or not Published at all. The results' are 
tabulated as Table 7.2. " • ■ ' . 

The findings in Table 7.2 are fairly consistent. " In^eve^y. one of th e 
ten instances in wh ich the comparison can:t)€ made, the average experimental 
effect from studies p ublishe<j in journals is larger than the corresponding 
effect estimated from the ses and dissertations . That is^-^ one integrates 
orrty^'published" (meaning journal published) studies, the iiiression, of 
, support for the favored hypothesis is artificially enhanced over what would 
be seen if the "entire literature were integrated .(i.e., journals, books and 
' 'dissertations). The bias in the journal.' literature relative to the bias 

in the Dissertation 1 fterature. is not -inconsiderable. 'The mean effect size 
rfor journals is .6C as compared with .48 for the ^iissertatioiCl i teratur§; 
hence, the bias is of the or*r of [(.54 - .48)/.4£] 100% = 332. Thus, 
findings reported in journals are, on the average, one-third standard 
, deviation more favorably disposed toward the favored hypotheses of the 
investigators than findings reported in. theses or dissertations. ' 

Comparisons of average effect sizes among other sources of publication 
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^ Table 7.2 

- Relationship Between Source of Publication and Findings 
in 12 Meta-A'^lyses of Experimental Literatures 



Source of Publication 



rnv^tigator(s) 


Topic ' 




Journal 


Book 


Thesis 


Unpubl . 


Kavale ('79) 


Psychol i ngui sti c 


n; 


13 




16 






training 














• 


r . : 


.50 




.30 


.37 


Hartley ('77; 


Computer-basec 


n: 


34 




13 


34 




instruc. 


















.35 




.28 


.54 




Tutori ng 


' n: 


9 




47 


17 






L . : 


.77 




.40 


1.05 


Rosenthal ('^5) , 


Experimenter 


n : 


25 




50 




bias' 














A . : 


1.02 




.74 




brmtn ( 80a J 


Sex b\as^ in 


n: 


28 


1 


32 




— \ 


psychotherapy 










) 






.22 




-.24 




Smith ( ' RDh ) i 


u T Tt:L Lb or 


n : 


on 

29 


Mr 

4 


164 


56 




aesthetics edyc. 










on basic skills 


^ : 


1.08 




.48 


.50 


Carlberg {'79) 


Spec. ed. room 


n : 


146 


17 


45 


114 




placemeat 










^ vs. reg. ros^ 


: 


-.09 


-.01 


-.15 


-.14 




placement 


c 








Resource room 


n : 


33 


6 








plac. vs. reg. 














room place. 


^ • •■ 


.32 


-.09 




y 
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Table 7:2 (continued) 



_4 

21 i 
.56 



Miller ('79) / Drug therapy 
^ * of psych. 

disordej's " 



336 
.49 



Hearold ("79) 



Effects of T.V 
on anti-social 
behav. 



262, 
.40 



120 
.14 



96 

.18 



13 
23 



. SUBTOTALS 



n:' 1025 

Z: .38 

I 



177 
.18 



473 
.30 



268 
.27 



Smith, Glass & Psychotherapy 
Miller ('80) 



n : 



1179 
.87 



42 
.80 



483 
.66 



61 
1.96 



TOTAL\ 



2204 
.64 



219 
.30 



956 
.48 



329 
.58 



r 
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are less clear, irv part perhaps, because of the ambiguity in labels such as 
"Unpublished" or "book." In four of six instances, journals gave more favorable 
results than books. In four of eight instances, the average effect size for ■ 
journals was larger "than for unpublished studies. Unpublished studies seemed 
to divide along the following lines: one large group of old unpublished 
studies containing unremarkable results that never caught anyone's attention, 
ana a smal l^-"group of new studies circulating -through the "invisible college" 
whi\^ waiting to be published. 

White (1976) also produced evidence of a selective publication effect 
». . 

. in nis meta-analysis of the relationship between socio-economic status and 
achievement. The average of 155 correlations published in books was ,31; 
38 r's -in journals averaged .25, and 286 dissertation correlations between * 
achievement and SES showed an average of .20. This trend, toward weaker 
relationships in dissertations than in journals, agrees with the trend 
established above for various experimental literatures. 

. Jhe compilation of results , from various meta-analyses. shows that th^re 
is substance to the criticism that most disciplines show evidences of a 
selection bias in what' they publish. And the bias may be large in some' 
instances; Smith's (1980) meta-analysis of sex-bias in psychotherapy is 
particularly relevant, as a final example. The very direction of the bias 
was reversed between the dissertation literature and published journals 
(from demonstrating a bias in favor of women in the thesis literature to a 
bias^ against women in journals); that this reversal was in accord with 
political ideologies that are presumed to "control access to journals matces 
the case even stronger that disciplines are prone to th6 temptation to 
reward findings they approve of by publishing them in more prestigious places. 
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However, the fact' of the existence of selective publicati-on tendencies 
is not in itself a togent criticism of meta-analysis, which after all, is 
used here to demonstrate the existence^and the magnitude of the phenomenon. 
• Indeed, the problem of selective publication cannot be dealt with adequately 
in integrating a research literature except by meta-analytic means, i.e., 
by collecting an of the literature at the outset and analyzing it separately 
by mode of publication. 

There exists another factor with respect 'to which selection often 
^ takes place. during research integration, namely, the date on which the 

studies were published. It is' common for reviewers to restrict their attention 

to a particular span of years and review only studies qf that period, e.g., 

"This review will consider al \ 1 aboratory studies on attention processes 
, published after 1950." The choice of dates is invariably arbitrary and 

governed by Convenience. It behooves us to inquire into the matter of 

chronological trends in research findings. 

In Table 7.3 appears a compilation of correlations between date of 

publication and effect sfze from size meta-analys^ of experimental 

li^ratures. 

p ' ■ 

The avferage of the eight correlations in Table 7.3 is +a3, indicating " 

that more recently published experiments show a slight tendency toward 
larger effects^ than older studies. (The weighted average r, each r weighted 
by the number of effect sizes in the particular meta-analysis, equals +.07. 
The unweighted average is probably more sensible because it is not affected 
by some meta-analyses arbitrarily having more data points.) Assuming a 
correlation^of +.13 between date of publication and effect size and some 
reasonable parameters for the independent variable (Date) and the dependent 
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Table 7.3 , 
Correlation Between Bate of Publication and Effect Size 
for Six Meta-Analyses of Experimental Literatures 



Investigator(s) 



Topic 



Correlation Between , 
Date of Publ icatipn ' 
and Effect Size 



Kavale ('79) 
Hall {'78) 
Smith ( '80) 
Carlberg ('79} 



1 



. Miller ('78) 



Smith, Glass 
& Millef ('80) 



Psychcffinguistic training ^ 

Gender effects in non-verbal coding 

Sex bias in > psychotherapy 

Spec. ed. room placement 
Resource room placement 
Other spec. ed. intervention 



Drug treatment of psychological 
disorders 



Psychotherapy 



r = -.01 (n = 25) 

r = .28 (n = 44) 

r = .29 {n = 60) 

r = .02 (n = 322) 
r = .32 (n = 39) . 
r = .08 (n = 156) 

r = -.01 "(n = 358) " 
'r = '.07 (n = 1,764) 
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^' ' \ ■ ■' ' ■ 

variable (Effec't^Size or A ), then a linear regression equation can be 

^'^constructed that relates date of ^publication to effect size: 



> ^ .13 {^l Date + .70 - .13 1970 



The above equation contains some assumed values for the means and 
standard deviations of Date '^nd A ; 



i 



Variable Mean Standard Deviation 

Date ; 1970 ' 4 years 

^ .70 ' .57 



S^ibstitufing the dates 1955iand .1975, each about one standard- 
J- 

deviation awas^ from the mean, into the regression equation gives: 

= .59, and 



A 1965 
A 1975 



.81. » 



These calculations indicate that'the typical correlation between 
date o| publication and effect size (r « .13) implies that'*experiments 
publisnedin 1975 show^ a .22 average effect size advantage over experiments 
published in 1965. This difference amounting to [{.81 - .59)7.59] lOOS « 37% 
is "comparable to the difference in average effect size between journals and 
theses-. Thus the conc^nts^bout bias that applied in th^ ^se t)f selectivity 
in publication outlet appear to apply with nearly equal force to the case of, 
selection of studies by date. It would seem, i 1 1-advised to begin the. 
integration of an empirical re^rch literature by arbi trari ly restricting . 
the studies consi^lered to those pubHshed in refe«iced journals after 1960, 
for examn^e." 
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Criticism #4. Lumpy (Non-Indepe ndent^ nata 

■ ..uU. ^^^!"j"'!!!j' ^""^"^^^^ °" "^ ^ ta sets in which mnltip U 

! ..i! . '"^ '""^ ^^'^ ""g^ders the data non- ^ 

independent and gives one a m istak Pmmpression of the reliability o f 

^ ' * 

Of an the technical criticisms of metaSalysis tha't have been 
published in the last five years (and most of these criticism* are quite 
off-the-mark and- shallow), th^ reminder that meta-analyses are typically 
carried out on lumpty sets of non-independent data U quite cogent. The 
principal implication of this ^non-independence is a reduction in th^ 
reliability of estilJlation of averages or of regressiofKequations. For 
example, if Study #1 gave effects*. 2. .2. .2 and .2 and Study #2 gave 
effects .6^, .5, and .6, one^vould have little reason to believe that he had 
been informed seven times about the aggregate result in question; rather 
thetrue "degrees of freedom" would seem to be somewhat closer to 2, the 
■.-number of studies, than to 7, the number of effects. A Vacile solution to 
this problem of non- independence would be to average all findings within 
a study up to the level of the study and proceed with a meta-analysis with 
"studies" as the unit of analysis. No doubt theTe will be Instances in 
which {his resolution of the problem will be satisfactory. But in most 
instances, it is likely to obscure many Important questions that can only 
b'e addressed -at the "within study" level of outcome variables, say. 
The effect on. accuracy" of estimation of complex interdependencies in a meta- 
analysis data base was addressed. at the end of Chapter Six. ^ " 
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CONCLUSION ^ . 

Of course, it is unclear what meta-analysis will contribute to Ae 
progress of empirical research. One can imagine a future for research in tfie 
^^^--^ocial and behavioral sciences in which questions are so sharply put and ^ 
techniquRTsO/ well standardized that studies would hardly need to be- integrated 
by merit:^$^heir consistent findings. But that future seems unlikely. 
Research will probably continue to be an unorganized, decentralized, non- 
standardized activity pursued simultaneously in dozens of places without 
thought to how it will all fit tefetter in the end. The need for formal 
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techniques^'of rese^^rch integratiS^} likeT*^pse we have illustrated w«ill 
probably grow. .{Whether future techniques will resemble these is^ncertain, but 
we suspect they will. The approachf^^^call meta-analysis seems to be too 
plainly. reasonable to be false in any simple s^nse. Whether it will be 
useful is a different matter. 
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mmeU. (5) referred 
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CanJVmc 



Cniunm ^ Vaiue Informaixm 



33 ... Cnn*p wpjnmcnj of clicmj (1) iwJom. (2) m4ichm|. (3) pmesi 

e^iaiK>Q. (4) conventeoce sample. (5) ocher ncmrandom 

34 Croup 2isif nment of iheraptut ( i ) random. (2} matchmg. (3) non- 

\ rwboni. (4) sjnglc iherar>»si. (5) oo< ipplicabte 

35 \ InicTToJ vaiuluy (1) Uxw. (2) nicUium. (3) bifh 

36 Number of ihrtais lo fniemai vtlnlny 

37-3S . . P e f cen iayc monaliiy from ircticd gmopi ^ 

3(^-40 ^ercenia|c moruiuy from companiofi poup 

41 U mort than one therapy compart^ timuhaneousN against comrol. 

(Dyes. (2) no 

42 Number of companions in this oudy 

43 Number of ihti comparison 

44^5 Number of outcome measures wnhin this companion 

46-47 . . Number of ihis outcome measure (ihe resi of ibe record dcais with 

ihts outcosK measure) 

Tre0mem 

4S-49 T/pe of trcam*em (2) piacebo. (3) psychodynamjc. (4) client- 



ctmered. (5) Adlenan. (6) jesaii. (7) sysienutic desensiioa- 
tion. (8) cognitivc/Elits. (9) cognitive/otber, (10) transacttonaJ 
ana^sis. 01) behavtor modificaiton. (12) eciecuc/dynamic. 
(I3>eclea»c betuvioraJ. (1^) ftaiiiy iherap). (15) vocaiionai/ 
penonal devck>pmem coun^ling. (16) coptiivc behavioral. 
(18) rmpkmon. (!9) bypnochenpy. (20) other 
Label (or therapy rype 
PropofieiH 



50-52 . . Ltt* code tor label 

53-55 . Lost coOe for pr o pon ent 

56 Corvfidenct of classification (I) low . (5) htgfa 

57 C\»M of therapy 

51 Superciau of therapy < 

59 Type of compannon ( 1 ) comroi. f2) placebo. (3) second treatmefM 

60 Type of control group ( I ) no ntatn^em. (2) waiting Usi. (3) uuact 

groop. (4) hosptul nnaintenancc. (5)other. (blank) noioontio] 

6i-^ Type of placebo Itst code 

Label of ^lactbo type 

63-65 .... Second i^ttatmefM rype ' 

M Alkyi»>ct of E to therapy com pai t d ' (1) yes. f2) ao. (3) unknown 

67 .... Modality (l}rndrv>dual, (2) grovc;, (3) family. (4) mtJMd. (5) auto- 

mated. (6) other. 0) unknown 
6^-69 ... Location of irtmrnem. (h school. Cl\ bospital. (3) mental health 

. . cen^. (4) other clinic. (5) other owtpatient, ^6) pnvau. (T) 

other. (K) unknown. Vf) college menul health facUny. (10) 
prmtn. (11) restdemtal facility 

70- 72 . Duraiion of thenpy in hours ' 

73-75 . . Duration of trtwment in weeks 

75-77 NumifT of theraptsu 

71- 79 , Ejipei^nct o( therapists in y«af% * 
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Card two 

column Vaiue Infomuiion 



1-5 
6 

9 



10-12 
16-17 



1^20 

:j-23 

24 

25-26 



V 

n 



29-34 
35-40 

47-51 
53-58 
5^64 
65-70 

7U76 



S<udy 10 

Rurmini comptnson Mimber * 

Ruimini measure mimber 

Runmng record number punch 2 for can! 2 

S«npie tut for trettmeni frwip 
S^pk sac for companson froup 

Ouicome type (i) fcwinxiet). C2) ielf-e«etm. f3) tesi measures 
and ramp of |k>teJ »djusimcm. (4) life indtcaion of aO;usi. 
mew. (5) penofulny iruu. (6) cmo<ionai/somaiic'<Jr»order^. 
O) •ddKtion. (S) lociopaihe behavion. (9) socuJ b^hivion. 
(lOj wori-school ich»cvemcm. (M) vqcaiionai/pcnooal de- 
veiopmeai, (12) physioioficaJ roeasures of sueu. (13) otbef 

Label of outcome measure 

Ust cock for outcome measure 

Number of weeks pbsi-iberapy measure was taken 

Reaamty of measure ( 1 ) low (51 htjh 

CaicuiaiKWi of effecisue (I) mean difference over comrol S D . 

(2) MS withm. (3) MS toiaJ m/nu$ trearmera. (4) prob^t.TIT 
chi square. (6) T tabic. a\ mean sod P, (8) nonparameincs. 
(9)corTeiatJoctt, (10) n*'(lau. (! 1 ) estimates. (12) other 

Source of means. ( I ) unad>uste<^ pos(-4esi. f2) covarwnce ^;usied. 

(3) residuaJ pms. (4) pie-post differences. (5) other 
Sifnificaoce of tftaunem effect (jd) - Xl. (1) - 005. C2) -.01 

C3) - 05. (4) - 10. (5) 10. (6) .05. f?) .01. (I) .003, (9) 

.001. (bUnk) fto( stftuTcm 
Treamicnt froup pre-mcaa ^ 
Treatment pre-standard devtaooo 
Treatment poct*mcaa 
Treaanent pon •standard' deviauoe 
ComparooQ poup pvt>n«ac 
Companson pfe-staadard deviauoa 
Companson poo-meaa 
Compansoe poti-sundard denttioe 



Card thiec 

column 



-InfonnauoQ 



K5 
6 

7-1 
9 

10-13. 

14.17 

18-22 

2J-i4 

25-26 



Sittdy 10 

Rummg companson manber 

Runnmg meamrt number 

Runmng record aomber puacb 3 for card 3 

T stausnc 

F staustK 

Mean ipuarc wtthm. residnai. or con M no a 
Treamtent frovp pereemafe imp r o ve d 
Companson froup peicewage im prov ed 
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ArrL\ni\ A 





Value 


Infonnaiion 


27- 3U 




Enect Sl2t 


31 





Class of lecood thenpy 


32 





SuptTcliu of lecond thenpy * 


33 




Aikfiancc of £ 10 ictood thenpy 


34 


. . . V . . 


Modality of lecond ihtnpy ^ 


35 




Location of second therapy 


36- 3S 




Ouniiofl of siecofid iherapyjn houn 


3^1 




Durvion of iccond ihcrapy m weeks 


42-i3 




Number of thenpists m second therapy 


44-45 




£xper>«nce of thenpiitj in second ihcrapy 


46 




Otfter facionai effects icwed (0) nooc.d) r»ct. (2) SES. f3)rE, (4) 
sex. (5) oO^ef 


47 




Is this vtt last effect witn this compansoft. ( I ) yes. C2) no 


4»-5i 




If yes. avettfe effea sire withm ihis companion 


52 




is this the lau effect sire in this study (1) yes. (2) no 


53-56 




If yes. tvengc of all effect sires in the study 



r 
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APPENDIX B 



STUDY USED AS CODING EXAMPLE 'IN CHAPTER FOUR 



Appendix removed due to copyright restrictions. Material removed can be 
obtained as: t 



KrUmboltz, John D,. ; Thoresen, Carl E. The Effect of Behavioral 
^ Counseling in Group and Individual Settings on Information- 
Seeking Behavior. Journal of Counseling Psychology , vll n4 
■ P324-33, 1964. 
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